Cloud Reliability Engineer
Posted 128ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Cloud Reliability Engineer responsible for maintaining cloud infrastructure in supply chain software. Collaborating with teams to optimize performance with automation and reliability principles.
Responsibilities:
- Operate, maintain, and improve cloud infrastructure in AWS, Azure, or GCP environments.
- Manage and optimize Kubernetes clusters — deployment, scaling, patching, and upgrades.
- Ensure system availability, scalability, and performance through proactive monitoring and optimization.
- Maintain infrastructure-as-code (IaC) for consistent and repeatable deployments.
- Identify opportunities for operational automation to eliminate manual processes (“reduce toil”).
- Build and maintain automated pipelines for deployments, configuration, and remediation.
- Develop self-healing mechanisms to automatically detect and resolve common service issues.
- Design proactive monitoring, alerting, and observability dashboards (Dynatrace, DataDog).
- Collaborate with DevOps and development teams to build reliable, observable, and resilient systems.
- Monitor, troubleshoot, and resolve infrastructure and application issues.
Requirements:
- Bachelor’s degree in computer science, Engineering, or related field (or equivalent experience).
- 5+ years of experience in experience in Cloud Engineering, DevOps, or Site Reliability roles.
- Hands-on experience with cloud platforms (OCI, AWS, Azure, or GCP).
- Strong knowledge of Kubernetes deployment, management, and troubleshooting.
- Solid understanding of observability and monitoring (e.g., Dynatrace, DataDog) and incident management platforms.
- Proficiency in scripting and automation (e.g., Python, Bash, Terraform, Ansible).
- Strong troubleshooting and analytical skills across infrastructure and applications.
- Experience with incident response, RCA, and postmortem processes.
- A mindset of continuous improvement, reliability, and self-healing automation.
- Understanding of SRE principles, SLAs/SLOs/SLIs, and chaos engineering practices.
Benefits:
- Competitive salary
- Flexible working hours
- Professional development budget
- Home office setup allowance
- Global team events



















