Reliability Operations Engineer

Posted 2hrs ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Managing operational reliability of robotics and cloud systems at Serve Robotics. Handling Tier 2 escalations while collaborating with product engineering and SREs.

Responsibilities:

  • Lead incident investigations during your region’s daytime hours, providing timely updates, escalating appropriately, and supporting senior engineers leading the response.
  • Respond to escalations from Tier 1 support using established runbooks, metrics, logs, and diagnostics to remediate issues or escalate to Tier 3 when needed.
  • Update runbooks and operational documentation based on new issues, discoveries, and feedback, ensuring clarity and consistency across all procedures.
  • Run existing automations and collaborate with senior team members to enhance tooling and scripts that streamline troubleshooting and remediation tasks.
  • Use observability tools such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to interpret metrics, logs, and traces, helping identify anomalies and validate system performance.
  • Provide concise, accurate updates during incidents, ensuring information reaches the correct engineering and SRE contacts and supporting structured incident coordination.
  • Participate in discussions around root causes, share operational insights, and contribute to process improvements that enhance system stability and supportability.
  • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise.
  • Proactively strengthen workflows, adopt best practices, and build the foundation of the Reliability Operations function as it evolves.

Requirements:

  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent hands-on experience.
  • 2–4 years of experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a related technical support function.
  • Experience participating in Tier 1 or Tier 2 investigations, including log review, basic triage, and structured escalation.
  • Exposure to operational environments supporting distributed or cloud-based systems.
  • Participation in incident response workflows and/or on-call rotations.
  • Proficiency with Linux, including navigating systems, reviewing logs, and performing basic diagnostics.
  • Experience using and contributing to runbooks and operational workflows.
  • Ability to interpret metrics, logs, and traces using tools such as Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry.
  • Familiarity with cloud platforms, preferably Google Cloud Platform (GCP).
  • Ability to follow documented remediation steps, with good judgment around when to escalate.
  • Understanding of CI/CD pipelines and how application deployments affect runtime behavior.
  • Experience using Jira or similar ticketing systems.
  • Clear and effective communicator, especially when providing updates during time-sensitive operational issues.
  • Calm, organized approach to troubleshooting and prioritization.
  • Collaborative mindset, working effectively with senior operations engineers, product teams, and SREs.
  • Strong sense of ownership and accountability for operational responsibilities.

Benefits:

  • Offers Equity