Staff Site Reliability Engineer – Incident Management & Reliability

Posted 146ds ago

Employment Information

Industry

Education

Salary

Experience

Job Type

Location

Report this job

Job expired or something wrong with this job?

Job Description

Expert engineer managing reliability in multi-cloud streaming platforms like Confluent Cloud. Focused on incident management, tooling, and training for enhanced reliability practices.

Responsibilities:

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs; coach teams through post-mortems
Partner with engineering leaders to elevate reliability practices org-wide

Requirements:

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Strong understanding of distributed systems and failure modes at scale
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post-mortems)
Experience driving org-wide process and cultural changes
Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Benefits:

Offers Equity

11hr

Principal DevSecOps – Platform Engineer

DevSecOps Engineer developing and operating security automation platforms for Department of Defense and Federal customers. Focus on hands-on software development within a DevSecOps context.

Staff Site Reliability Engineer – Incident Management & Reliability

Employment Information

Report this job

Job Description

Responsibilities:

Requirements:

Benefits:

Confluent

Report this job

Similar Jobs

General Dynamics Information Technology

PHIZENIX

Bertoni Solutions

Applaudo

Addvisor Group

Carrier

General Dynamics Information Technology

Abstra

Exoscale

Exoscale

Interval Group

Interval Group

GFT Technologies

Mercury Insurance

In All Media

Clever Real Estate

Tech9

Malwarebytes

Minor Hotels Europe and Americas

YPO