Senior Manager – Incident Response Engineering
Posted 25ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Senior Manager leading Incident Response Engineering for Confluent Cloud while ensuring customer-first incident management. Building and evolving a team to handle incidents at scale across cloud platforms.
Responsibilities:
- Build and Lead the Team
- Recruit, hire, and develop a team of senior incident response engineers distributed across AMER and APAC time zones
- Design sustainable on-call models with follow-the-sun coverage
- Own Incident Response
- Provide incident command for high-severity and critical customer-impacting incidents, with your team as the primary rotation and you as the senior escalation point
- Set and enforce standards for how incidents are run: communications cadence, directing engagements with stakeholders, domain expert coordination, handoffs
- Drive a customer-first posture in every incident to ensure timely, accurate updates and clear ownership from detection through resolution
- Drive Postmortem Rigor and Customer RCA Quality
- Own postmortem quality end-to-end: facilitation, root cause analysis, corrective action definition, and ensuring follow-through
- Manage the Customer Root Cause Analysis (CRCA) program, ensuring timely, technically accurate, clearly written documents that restore customer trust
- Coordinate upstream technical inputs from engineering teams; synthesize ambiguity into clear, actionable narratives
- Advance Incident Response Through AI and Automation
- Drive an AI-centric approach to scaling incident operations using intelligent tooling to improve triage speed, documentation quality, and pattern detection without sacrificing rigor
- Partner with observability, supportability, and resiliency sub-functions with CAR to provide critical inputs into our platform evolution
- Own and evolve the incident management tooling stack with a bias towards agentic assistance
- Analyze incident data to identify recurring patterns and feed learnings back into engineering practices
- When incident load allows, direct your team's capacity toward runbook improvements, automation, and operational hygiene
- Represent Cross-Functionally
- Partner with Legal, PR, and Customer Success on customer-facing communications during and after major incidents
- Brief engineering leadership and executives during active incidents with clarity and composure
- Be the person engineering teams proactively seek out when operational standards and incident practices need to improve
Requirements:
- 10+ years in SRE, incident management, or reliability engineering, with at least 5 years managing teams in this space
- Proven experience as an incident commander in high-severity, customer-impacting outages at scale. You've personally run incidents that mattered
- Cloud infrastructure experience across at least one of AWS, GCP, or Azure
- Deep understanding of distributed systems failure modes (Kafka/event streaming experience preferred, or demonstrated ability to rapidly master complex systems)
- Strong track record with postmortem facilitation and driving corrective actions to completion
- Excellent written communication with customers regarding root-cause analysis. You are comfortable stating things with conviction to executive audiences
- Experience working with cross-functional stakeholders (legal, PR, customer success) during incident response
- Track record of hiring and developing senior technical talent in a globally distributed, remote-first environment
- Comfort operating with significant autonomy and making high-stakes decisions under pressure.
Benefits:
- Offers Equity


















