Senior site reliability engineer/devops
RCS TECH • mexico city, Mexico
Role Description
What You’ll DoReliability & Operations
-Own availability, latency, and scalability across Saa S and AI systems
- Define and enforce SLOs, SLIs, and error budgets
- Participate in a global on-call rotation (~1 week every 4 weeks)
- Lead incident response and drive blameless postmortems with systemic fixes
Platform & Infrastructure
- Architect and operate on-premise and multi-region, multi-cloud environments
- Manage large-scale Kubernetes workloads
- Build and evolve infrastructure using Terraform and Ansible
- Improve system resilience, fault isolation, and capacity planning
AI/ML & Automation
- Build and scale agentic AI systems for triage, anomaly detection, and self-healing
- Ensure reliability of model serving infrastructure
- Operate, optimize and scale distributed systems
What You Bring
- 5+ years in SRE, Production Engineering, or Platform Engineering
- Strong experience with cloud providers (AWS/GCP/OCI), Kubernetes, and Ia C (Ter...
-Own availability, latency, and scalability across Saa S and AI systems
- Define and enforce SLOs, SLIs, and error budgets
- Participate in a global on-call rotation (~1 week every 4 weeks)
- Lead incident response and drive blameless postmortems with systemic fixes
Platform & Infrastructure
- Architect and operate on-premise and multi-region, multi-cloud environments
- Manage large-scale Kubernetes workloads
- Build and evolve infrastructure using Terraform and Ansible
- Improve system resilience, fault isolation, and capacity planning
AI/ML & Automation
- Build and scale agentic AI systems for triage, anomaly detection, and self-healing
- Ensure reliability of model serving infrastructure
- Operate, optimize and scale distributed systems
What You Bring
- 5+ years in SRE, Production Engineering, or Platform Engineering
- Strong experience with cloud providers (AWS/GCP/OCI), Kubernetes, and Ia C (Ter...