Senior Incident Manager

Posted 1hrs ago

Employment Information

Industry
Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Senior Incident Manager overseeing critical incident management for AI cloud infrastructure. Ensuring rapid resolution and operational resilience across data center operations and engineering teams.

Responsibilities:

  • Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
  • Serve as the Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
  • Act as the liaison between leadership and external teams during incidents/post-incidents to provide updates and status summaries.
  • Own the incident response lifecycle including:
  • Assisting Technical Triage
  • Escalation
  • Coordination
  • Resolution
  • Ensure timely and accurate communication with internal stakeholders and leadership.
  • Maintain incident response documentation and operational playbooks.
  • Conduct analysis on incidents and identify patterns/trends for improvement in response and systems reliability.
  • Work in an On-Call Rotation to respond to, lead, and coordinate incidents
  • Drive alignment during outages involving multiple infrastructure layers.
  • Lead post-incident reviews (PIRs) and root cause analysis. Identify systemic reliability gaps and implement corrective actions.

Requirements:

  • 8+ years experience in incident management, site reliability engineering, or infrastructure operations
  • Experience managing incidents in large-scale distributed infrastructure environments
  • Strong understanding of:
  • Data center operations
  • GPU compute clusters
  • Networking and storage infrastructure
  • Cloud or hybrid infrastructure platforms
  • Proven ability to lead high-pressure incident response situations
  • Experience with incident management frameworks (ITIL, SRE, or equivalent)
  • Excellent communication and stakeholder management skills
  • Experience with incident tracking and monitoring tools such as:
  • PagerDuty
  • ServiceNow
  • Jira
  • Datadog
  • Prometheus / Grafana

Benefits:

  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan that we all actually use