AI Platform Engineer – ML Ops
Posted 97ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
AI Platform Engineer at Toku managing MLOps pipelines and deploying AI systems. Focusing on cloud-native AI workloads with strong collaboration across engineering teams.
Responsibilities:
- Design, improve, and operate MLOps pipelines for training, deploying, and managing ML models in production.
- Build and maintain CI/CD-style workflows for model packaging, versioning, and deployment across environments.
- Operate and optimise AWS-based infrastructure for AI workloads, including compute, storage, and networking components.
- Manage GPU-enabled workloads, addressing scalability, reliability, and cost-efficiency for high-load AI applications.
- Implement monitoring and alerting for deployed models, focusing on system health, performance, and operational stability.
- Own and evolve shared tooling such as MLflow, Docker-based workflows, and deployment frameworks to improve developer productivity.
- Work closely with infrastructure, SRE, and engineering teams to align AI platform practices with broader system standards.
- Support live AI services by diagnosing deployment, scaling, and infrastructure-related issues impacting AI features.
- Ensure reproducibility, traceability, and governance across the full ML lifecycle, from experimentation to production.
Requirements:
- Hands-on experience building and operating MLOps pipelines for production ML systems.
- Strong experience with AWS services used for AI workloads, including EC2, ECS, and SageMaker.
- Practical experience with Docker and container-based deployment of ML workloads.
- Experience with MLflow or similar tools for experiment tracking, model versioning, and lifecycle management.
- Experience managing GPU-based workloads and addressing performance and cost challenges at scale.
- Strong understanding of cloud infrastructure concepts as they apply to ML systems.
- Ability to work with Python-based ML codebases to support deployment and lifecycle needs.
- Working familiarity with LLMs, NLP models, and applied ML concepts sufficient to support deployment and monitoring (without owning core model development).
- Proven experience supporting live, production ML systems with real customer impact.
- Ability to work cross-functionally with applied AI engineers, backend engineers, and infra teams.
Benefits:
- Training and Development
- Discretionary Yearly Bonus & Salary Review
- Healthcare Coverage based on location
- 20 days Paid Annual Leave (excluding Bank holidays)


















