Staff Site Reliability Engineer – Incident Management & Reliability

Posted 101ds ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Expert engineer managing reliability in multi-cloud streaming platforms like Confluent Cloud. Focused on incident management, tooling, and training for enhanced reliability practices.

Responsibilities:

  • Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
  • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
  • Own standards, practices, and continuous improvement of incident response across engineering
  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
  • Develop and deliver training programs; coach teams through post-mortems
  • Partner with engineering leaders to elevate reliability practices org-wide

Requirements:

  • 10+ years of relevant experience in SRE, incident management, or reliability engineering
  • Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
  • Experience navigating reliability/incident programs at 500+ engineer organizations
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
  • Strong understanding of distributed systems and failure modes at scale
  • Deep experience with observability: metrics, logging, tracing
  • Kubernetes and container orchestration experience
  • Understanding of CI/CD pipelines and release processes
  • Strong written communication (design docs, runbooks, post-mortems)
  • Experience driving org-wide process and cultural changes
  • Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Benefits:

  • Offers Equity