Senior Site Reliability Engineer
Posted 1ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Senior Site Reliability Engineer at ZigZag, focused on designing and maintaining scalable cloud infrastructure. Collaborates with engineering teams for reliability and performance improvement.
Responsibilities:
- As a Site Reliability Engineer, you’ll design, build, and maintain the infrastructure and automation that power our platform.
- Working closely with software engineering teams and SRE peers, you'll embed reliability, performance, and compliance into the development lifecycle.
- Your focus will be on scalability, resilience, security, and operational efficiency across all environments.
- Design, implement, and continuously improve highly available, scalable, secure, and resilient cloud infrastructure and platform services.
- Define and evolve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and operational metrics to drive measurable reliability outcomes.
- Lead incident response activities, major incident management, root cause analysis, and post-incident reviews focused on systemic improvement.
- Drive reduction of operational toil through automation, standardisation, and self-healing platform capabilities.
- Develop and maintain disaster recovery, backup, failover, and resilience strategies to meet defined RTO and RPO objectives.
- Conduct capacity planning, performance analysis, and proactive optimisation of infrastructure and application environments.
- Champion operational maturity and continuous improvement practices across engineering teams.
- Architect, build, and maintain scalable cloud-native infrastructure primarily within AWS environments.
- Develop and maintain infrastructure-as-code using tools such as Terraform and CloudFormation.
- Build reusable platform components and shared services that improve developer productivity and operational consistency.
- Develop automation tooling and operational frameworks using scripting and programming languages such as Python.
- Analyze system behaviour and performance trends to proactively identify risks and optimisation opportunities.
Requirements:
- 5+ years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, or related infrastructure roles.
- Strong hands-on experience operating production workloads within AWS cloud environments.
- Deep experience with infrastructure-as-code tools such as Terraform and/or CloudFormation.
- Strong experience designing and supporting CI/CD pipelines and modern software delivery practices.
- Strong understanding of distributed systems, microservices architecture, networking, and cloud-native technologies.
- Experience implementing observability and monitoring solutions across complex environments.
- Strong scripting and automation experience using Python, Bash, or similar languages.
- Experience managing production incidents and conducting structured root cause analysis.
- Strong understanding of system reliability, scalability, security, and operational best practices.
- Excellent analytical, troubleshooting, and problem-solving capabilities.
- Strong communication and stakeholder engagement skills.
- Ability to work effectively in fast-paced, agile, and collaborative engineering environments.
Benefits:
- ZigZag is committed to building a diverse, inclusive, and equitable workplace.
- We believe that talent knows no borders, and we welcome individuals from all backgrounds to help us shape the future of work.
- Guided by transparency and agility, we foster an environment where everyone is valued and empowered to thrive.

















