Service Reliability & Operations Manager
Posted 70ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Service Reliability & Operations Manager overseeing IT services for Kentro, ensuring IT service stability and operational excellence.
Responsibilities:
- Lead teams responsible for Application Performance Monitoring (APM), observability, and “eyes on glass” 24/7 monitoring functions.
- Ensure proactive detection of service degradation and performance anomalies.
- Drive adoption of modern monitoring tools, dashboards, and alerting frameworks.
- Oversee the major incident process, ensuring rapid triage, escalation, communication, and resolution.
- Serve as the escalation point for Critical/High incidents and coordinate cross-functional response.
- Conduct post-incident reviews and ensure corrective actions are implemented.
- Manage sustainment of critical integrations, ensuring reliability, version alignment, and lifecycle management.
- Partner with engineering teams to ensure smooth handoffs from project delivery to steady-state operations.
- Maintain documentation, runbooks, and operational readiness standards.
- Track and improve KPIs such as MTTR, service availability, alert fidelity, and incident volume trends.
- Identify systemic issues and drive continuous improvement initiatives across operations.
- Ensure alignment with ITIL processes, especially incident, problem, and change management.
- Lead, mentor, and develop a team of analysts, engineers, and incident managers.
- Foster a culture of accountability, collaboration, and operational discipline.
- Build succession plans, training programs, and career pathways for operational staff.
- Partner with other ESOM teams to ensure end-to-end service reliability.
- Work closely with the PMO on readiness for new services, innovation pilots, and portfolio changes.
- Provide clear, concise communication to leadership during incidents and operational reviews.
Requirements:
- Bachelor's degree in computer science, electronics engineering, or other engineering or technical discipline
- 10+ years in IT operations, service reliability, or incident management, including 5+ years managing managers and large teams.
- Experience overseeing large teams while supporting a Federal client.
- Proven experience leading multi-site IT operations and large-scale teams (400+ employees).
- Strong background in ITIL practices, incident management, and customer support operations.
- History of collaboration and flexibility, including innovative solutions to solve challenges facing geographically distributed teams.
- Exceptional leadership, coaching, and interpersonal communication skills.
- Strong analytical and problem-solving skills with a data-driven mindset.
- Ability to build and maintain strong client relationships and manage escalations effectively.
- Experience with APM, observability platforms, enterprise monitoring tools, and KPI reporting.
- Ability to prioritize work and self-direct with minimal input.
- Strong messaging capabilities to create team cohesion, team-focus and ongoing drive.
- ITIL Certification (preferred)
- Experience with end-user technologies and concepts (preferred)
- Strategic thinking with a focus on operational excellence.
- Ability to influence and inspire large teams.
- Results-oriented with a track record of delivering high customer satisfaction.
- Adaptability and resilience in a fast-paced, multi-client environment.
- US Citizen or Green card holder
- Willing and able to get a Public Trust Suitability clearance
- Must meet updated ID requirements: If you do not currently meet the ID requirements outlined, you must be willing and able to update your current forms of ID in a timely manner to complete the suitability process successfully.


















