Reliability Operations Engineer

Posted 55ds ago

Employment Information

Industry

Education

Salary

Experience

Job Type

Location

Report this job

Job expired or something wrong with this job?

Job Description

Managing operational reliability of robotics and cloud systems at Serve Robotics. Handling Tier 2 escalations while collaborating with product engineering and SREs.

Responsibilities:

Lead incident investigations during your region’s daytime hours, providing timely updates, escalating appropriately, and supporting senior engineers leading the response.
Respond to escalations from Tier 1 support using established runbooks, metrics, logs, and diagnostics to remediate issues or escalate to Tier 3 when needed.
Update runbooks and operational documentation based on new issues, discoveries, and feedback, ensuring clarity and consistency across all procedures.
Run existing automations and collaborate with senior team members to enhance tooling and scripts that streamline troubleshooting and remediation tasks.
Use observability tools such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to interpret metrics, logs, and traces, helping identify anomalies and validate system performance.
Provide concise, accurate updates during incidents, ensuring information reaches the correct engineering and SRE contacts and supporting structured incident coordination.
Participate in discussions around root causes, share operational insights, and contribute to process improvements that enhance system stability and supportability.
Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise.
Proactively strengthen workflows, adopt best practices, and build the foundation of the Reliability Operations function as it evolves.

Requirements:

Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent hands-on experience.
2–4 years of experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a related technical support function.
Experience participating in Tier 1 or Tier 2 investigations, including log review, basic triage, and structured escalation.
Exposure to operational environments supporting distributed or cloud-based systems.
Participation in incident response workflows and/or on-call rotations.
Proficiency with Linux, including navigating systems, reviewing logs, and performing basic diagnostics.
Experience using and contributing to runbooks and operational workflows.
Ability to interpret metrics, logs, and traces using tools such as Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry.
Familiarity with cloud platforms, preferably Google Cloud Platform (GCP).
Ability to follow documented remediation steps, with good judgment around when to escalate.
Understanding of CI/CD pipelines and how application deployments affect runtime behavior.
Experience using Jira or similar ticketing systems.
Clear and effective communicator, especially when providing updates during time-sensitive operational issues.
Calm, organized approach to troubleshooting and prioritization.
Collaborative mindset, working effectively with senior operations engineers, product teams, and SREs.
Strong sense of ownership and accountability for operational responsibilities.

Benefits:

Offers Equity

Reliability Operations Engineer

Employment Information

Report this job

Job Description

Responsibilities:

Requirements:

Benefits:

Serve Robotics

Report this job

Similar Jobs

Shermco Industries

The Lighthouse

Sphere Labs

Guidant Financial

Guidant Financial

Guidant Financial

Catholic Relief Services

Tremendous

Kiva

Daktronics

Cartesia

Two Chairs

Infinitus Systems, Inc.

Akamai Technologies

Méliuz

Jones Lang LaSalle Americas, Inc.

AKA Imagine L.L.C.

New Energy Equity

New Energy Equity

SafeRide Health