Staff ML Ops Engineer
Posted 64ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Backend & Infrastructure Engineer shaping AI development at Albert, a materials innovation platform. Designing Kubernetes and high-performance infrastructure for AI workloads.
Responsibilities:
- Design, deploy, and maintain Kubernetes infrastructure supporting AI/ML workloads
- Manage containerized services, autoscaling, networking, and resource optimization
- Design and build high-performance Python APIs and services using FastAPI or similar frameworks
- Architect backend systems for scalability, reliability, and low latency
- Build integrations between AI/ML systems and the broader Albert platform
- Build and operate distributed systems that handle compute-intensive and high-throughput workloads
- Design for fault tolerance, graceful degradation, and horizontal scalability
- Implement async workflows, job queues, and task orchestration as needed
- Architect and maintain data pipelines and storage systems supporting AI/ML workflows
- Implement observability including logging, metrics, tracing, and alerting
- Own system reliability—troubleshoot issues, conduct post-mortems, and continuously improve
- Design CI/CD pipelines and promote automation best practices
- Partner closely with ML engineers to understand requirements and deliver production-ready infrastructure
- Translate ML prototypes and research code into scalable, maintainable systems
Requirements:
- A degree in Computer Science or a related field with 7+ years of industry experience (Bachelor's) or 5+ years (Master's or PhD) in software engineering
- Experience supporting AI/ML teams or deploying ML systems in production
- Experience with GPU workloads and scheduling
- Advanced proficiency in Python including async programming and performance optimization
- Deep experience with Kubernetes—cluster management, networking, autoscaling, and troubleshooting
- Strong background in distributed systems and microservices architecture
- Experience with cloud platforms (AWS, GCP, or Azure) and infrastructure-as-code
- Proficiency in REST API development using FastAPI, Flask, or similar
- Experience with containerization and CI/CD pipelines
- Track record of operating production systems at scale
Benefits:
- Health insurance
- Flexible working hours
- Professional development opportunities



















