Senior Manager, DevOps
Posted 2hrs ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Senior Manager, DevOps leading infrastructure and platform engineering efforts at TrueML. Focus on cloud architecture and CI/CD standards for machine learning-driven products.
Responsibilities:
- Define and execute the long-term strategic vision for Infrastructure as Code (IaC), CI/CD evolution, and cloud-native architecture to support TrueML’s scaling needs.
- Lead the design and implementation of self-service internal platforms to reduce developer cognitive load, enabling feature teams to deploy and manage services with minimal friction at increased velocity.
- Act as the primary stakeholder for cloud spend (AWS); drive cost-optimization initiatives and lead contract negotiations for the DevOps toolstack and third-party vendors.
- Ensure the infrastructure architecture supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols, maintaining system integrity across multiple regions.
- Oversee the implementation and evolution of comprehensive monitoring, logging, and distributed tracing systems, leveraging AIOps to move from reactive to predictive system maintenance.
- Champion security by design by integrating automated vulnerability scanning, secret management, and compliance checks directly into the automated build pipelines.
- Serve as the ultimate escalation point for major production outages, facilitating blameless post-mortem reviews that focus on systemic improvements rather than individual error.
- Maintain deep technical currency in container orchestration (Kubernetes), serverless patterns, and modern automation frameworks to provide meaningful mentorship and architectural guidance to senior engineering staff.
Requirements:
- Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- 10+ years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering; 5+ years of experience managing engineers
- Expert-level mastery with AWS and experience managing multi-region, high-availability deployments
- Advanced experience with Kubernetes (K8s) and Docker, including cluster management, networking, and scaling in a production environment.
- Proficiency in Terraform to drive consistency and automation across all infrastructure layers. Experience with Atlantis is a plus.
- Deep experience designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and mastery of scripting languages like Python, Go, or Bash.
- Hands-on experience with modern monitoring, observability, and tracing stacks (Datadog, Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets).
- Experience acting as an Incident Commander for high-severity outages and fostering a "blameless" post-mortem culture.
- Demonstrated ability to influence executive leadership and collaborate cross-functionally with Product, Engineering, and Security teams.
- Experience integrating AI-assisted productivity tools (Cline, GitHub Copilot) into the engineering workflow to accelerate delivery.




















