Senior Site Reliability Engineer, SRE
Posted 1ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Site Reliability Engineer ensuring system reliability and performance for Compass UOL. Collaborating with teams to automate processes and improve operational excellence.
Responsibilities:
- Ensure the reliability, availability, scalability, and performance of production systems;
- Define, monitor, and evolve SLIs, SLOs, SLAs, and Error Budgets;
- Implement and enhance observability practices, including logs, metrics, tracing, and alerts;
- Participate in response to critical incidents, conduct root cause analyses (RCA), and lead blameless post-mortems;
- Automate operational processes to reduce manual work and increase efficiency;
- Collaborate with Development, DevOps, and Architecture teams to prevent systemic failures;
- Plan and validate strategies for high availability, scalability, capacity planning, and disaster recovery;
- Support technical decisions through analysis of reliability, performance, and utilization metrics;
- Contribute to the continuous evolution of a reliability culture and operational excellence.
Requirements:
- Bachelor's degree in Computer Science, Software Engineering, Information Systems, or a related field;
- Proven experience in SRE, IT Operations, Cloud, or Software Engineering;
- Experience with critical, distributed, and high-availability environments;
- Experience with monitoring, incident management, and operational reliability;
- Experience with large-scale AWS environments;
- Advanced knowledge of Docker and Kubernetes;
- Experience with observability, monitoring, and troubleshooting tools;
- Automation skills using Python and Shell scripting;
- Knowledge of resilience concepts, disaster recovery, capacity planning, and security;
- Experience with Chaos Engineering;
- Knowledge of OpenTelemetry and distributed observability.


















