Service Reliability & Operations Manager

Posted 70ds ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Service Reliability & Operations Manager overseeing IT services for Kentro, ensuring IT service stability and operational excellence.

Responsibilities:

  • Lead teams responsible for Application Performance Monitoring (APM), observability, and “eyes on glass” 24/7 monitoring functions.
  • Ensure proactive detection of service degradation and performance anomalies.
  • Drive adoption of modern monitoring tools, dashboards, and alerting frameworks.
  • Oversee the major incident process, ensuring rapid triage, escalation, communication, and resolution.
  • Serve as the escalation point for Critical/High incidents and coordinate cross-functional response.
  • Conduct post-incident reviews and ensure corrective actions are implemented.
  • Manage sustainment of critical integrations, ensuring reliability, version alignment, and lifecycle management.
  • Partner with engineering teams to ensure smooth handoffs from project delivery to steady-state operations.
  • Maintain documentation, runbooks, and operational readiness standards.
  • Track and improve KPIs such as MTTR, service availability, alert fidelity, and incident volume trends.
  • Identify systemic issues and drive continuous improvement initiatives across operations.
  • Ensure alignment with ITIL processes, especially incident, problem, and change management.
  • Lead, mentor, and develop a team of analysts, engineers, and incident managers.
  • Foster a culture of accountability, collaboration, and operational discipline.
  • Build succession plans, training programs, and career pathways for operational staff.
  • Partner with other ESOM teams to ensure end-to-end service reliability.
  • Work closely with the PMO on readiness for new services, innovation pilots, and portfolio changes.
  • Provide clear, concise communication to leadership during incidents and operational reviews.

Requirements:

  • Bachelor's degree in computer science, electronics engineering, or other engineering or technical discipline
  • 10+ years in IT operations, service reliability, or incident management, including 5+ years managing managers and large teams.
  • Experience overseeing large teams while supporting a Federal client.
  • Proven experience leading multi-site IT operations and large-scale teams (400+ employees).
  • Strong background in ITIL practices, incident management, and customer support operations.
  • History of collaboration and flexibility, including innovative solutions to solve challenges facing geographically distributed teams.
  • Exceptional leadership, coaching, and interpersonal communication skills.
  • Strong analytical and problem-solving skills with a data-driven mindset.
  • Ability to build and maintain strong client relationships and manage escalations effectively.
  • Experience with APM, observability platforms, enterprise monitoring tools, and KPI reporting.
  • Ability to prioritize work and self-direct with minimal input.
  • Strong messaging capabilities to create team cohesion, team-focus and ongoing drive.
  • ITIL Certification (preferred)
  • Experience with end-user technologies and concepts (preferred)
  • Strategic thinking with a focus on operational excellence.
  • Ability to influence and inspire large teams.
  • Results-oriented with a track record of delivering high customer satisfaction.
  • Adaptability and resilience in a fast-paced, multi-client environment.
  • US Citizen or Green card holder
  • Willing and able to get a Public Trust Suitability clearance
  • Must meet updated ID requirements: If you do not currently meet the ID requirements outlined, you must be willing and able to update your current forms of ID in a timely manner to complete the suitability process successfully.