Site Leader

Posted 1hrs ago

Employment Information

Industry
Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Site Leader overseeing SRE and infrastructure teams for Weekday's client. Responsible for building resilient systems and ensuring availability and performance of platforms.

Responsibilities:

  • Lead and manage SRE and Infrastructure teams, driving operational excellence and fostering a culture of reliability and accountability.
  • Define and execute the overall infrastructure and reliability strategy aligned with business goals.
  • Oversee the design, deployment, and maintenance of scalable, highly available, and secure systems.
  • Establish and monitor SLAs, SLOs, and SLIs, ensuring consistent service performance and uptime.
  • Drive incident management processes, including root cause analysis, postmortems, and continuous improvement initiatives.
  • Collaborate with product and engineering teams to embed reliability and scalability into the development lifecycle.
  • Champion automation, observability, and proactive monitoring to minimize downtime and improve system health.
  • Manage infrastructure costs, capacity planning, and resource optimization.
  • Mentor and develop engineering managers and senior engineers, building a strong leadership pipeline.
  • Ensure adherence to best practices in cloud infrastructure, DevOps, and security compliance.

Requirements:

  • 10–15 years of experience in software engineering, infrastructure, or SRE, with at least 3–5 years in an Engineering Manager or leadership role.
  • Proven expertise in Site Reliability Engineering (SRE) principles, including reliability, scalability, and fault tolerance.
  • Strong experience with cloud platforms (such as AWS, GCP, or Azure) and modern infrastructure architectures.
  • Deep understanding of infrastructure as code (Terraform, CloudFormation), CI/CD pipelines, and containerization technologies (Docker, Kubernetes).
  • Demonstrated ability to lead and scale distributed engineering teams.
  • Strong problem-solving skills with a focus on system-level thinking and root cause analysis.
  • Experience with monitoring and observability tools such as Prometheus, Grafana, ELK stack, or similar.
  • Excellent stakeholder management and communication skills, with the ability to influence cross-functional teams.