Senior DevOps Engineer – ML Infrastructure

Posted 107ds ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Senior DevOps Engineer working on the ML Infrastructure team at Serve Robotics. Responsible for designing, building, and maintaining a petabyte-scale data and ML platform.

Responsibilities:

  • Deploy and maintain our ML training orchestration system that operates across multiple platforms.
  • Manage cloud and on-premise environments for large-scale distributed data processing and ml training/inference systems.
  • Automate deployment pipelines, monitoring, and alerting for ML and data services.
  • Collaborate closely with data scientists, ML engineers, and autonomy teams to streamline experimentation and model deployment.
  • Maintain and improve CI/CD systems to support rapid development and testing.
  • Implement best practices for system security, reliability, and observability.
  • Optimize infrastructure costs and ensure efficient resource utilization.
  • Support internal developer productivity through tooling, documentation, and support.

Requirements:

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent experience.
  • 5+ years of experience as a DevOps, SRE, or Infrastructure Engineer, preferably supporting ML or data-intensive systems.
  • Strong experience with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes, Docker).
  • Proficiency in infrastructure-as-code tools such as Terraform or Helm.
  • Solid understanding of CI/CD systems (GitLab CI, Jenkins, ArgoCD, etc.).
  • Experience with Python and SQL
  • Experience with cloud security, IAM (Identity and Access Management), and access control
  • Experience analysing and optimizing hardware performance
  • Experience with GPU cluster management

Benefits:

  • Offers Equity