Platform/DevOps Engineer
Posted 98ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Own infrastructure reliability and cost optimization for a production platform serving diverse customers. This role emphasizes building resilient, secure, and cost-efficient cloud infrastructure.
Responsibilities:
- Ensure 99.5% uptime SLA across all production services and customer environments.
- Design and maintain multi-region deployments to support geographic redundancy.
- Implement automated failover mechanisms for databases, load balancers, and critical services.
- Build and manage disaster recovery strategies, including automated backups and point-in-time recovery.
- Lead incident detection, response, and postmortems, meeting defined SLAs for P0 issues.
- Develop real-time observability dashboards for uptime, latency, error rates, and system health.
- Monitor application and infrastructure performance metrics across customers.
- Implement alerting, on-call rotations, escalation policies, and PagerDuty integrations.
- Manage log aggregation and retention using SIEM platforms such as Splunk or Sumo Logic.
- Support SOC 2 Type II preparation through security controls, monitoring, and documentation.
- Implement vulnerability scanning, penetration testing coordination, and DLP controls.
- Optimize cloud infrastructure costs through right-sizing, auto-scaling, and storage lifecycle policies.
- Track and report infrastructure and API costs per customer, driving FinOps best practices.
- Build automated runbooks and self-healing workflows for common incidents.
Requirements:
- Strong experience as a Site Reliability Engineer, DevOps Engineer, or Platform Engineer.
- Deep expertise in AWS cloud architecture (ECS, EKS, RDS, Lambda, S3, CloudFront).
- Proven experience with Infrastructure as Code using Terraform or CloudFormation.
- Hands-on production experience with Kubernetes and container orchestration.
- Strong knowledge of observability and monitoring tools (Datadog, New Relic, Prometheus, Grafana).
- Experience managing on-call rotations, incident response, and post-incident reviews.
- Solid understanding of security practices including SIEM, vulnerability scanning, and SOC 2 compliance.
- Demonstrated experience in cloud cost optimization and FinOps practices.
- Ability to operate independently and prioritize reliability in high-availability environments.
Benefits:
- Health insurance
- Flexible work arrangements
- Professional development opportunities



















