Senior Platform Engineer – DevOps, Infrastructure and Platform

Posted 14hrs ago

Employment Information

Industry

Education

Salary

Experience

Job Type

Location

Job expired or something wrong with this job?

Senior Platform Engineer at Ozmap responsible for AWS and Linux environments, troubleshooting, and building CI/CD pipelines for continuous delivery.

Design, operate and evolve AWS (EC2) and on-premises environments with containers (Docker), ensuring availability, security and scalability;
Operate and administer Linux production environments (systemd, kernel/network tuning, I/O, process troubleshooting);
Build and evolve CI/CD pipelines from scratch, including quality and security gates;
Develop end-to-end observability (instrumentation, exporters, PromQL, SLI/SLO, alerts);
Lead advanced troubleshooting, root cause analysis and blameless post-mortems — driving structural change afterwards, not just producing a report;
Implement automation using Infrastructure as Code;
Analyze and optimize cloud costs: rightsizing, usage analysis and proposing data-driven alternatives;
Act as a technical reference for developers and engineers, influencing architecture without relying on formal authority.

Required: production experience operating core primitives in AWS (~4+ years): EC2, VPC/networking, IAM and security — production operation and technical decision-making;
Linux and networking (~4+ years): server administration and production troubleshooting — disk full, OOM killer, network diagnostics; processes, memory and I/O;
CI/CD built from scratch (~3+ years): pipelines created and evolved by you (GitHub Actions, Jenkins, self-hosted runners, secrets, caching, gates);
End-to-end open-source observability (~2+ years): Prometheus, Grafana, Loki, VictoriaMetrics or equivalents — configured and operated by you, not just used. OpenTelemetry — including instrumentation, exporters, PromQL and SLI/SLO definition;
Operation under managed layers: concrete experience with nginx/HAProxy/Envoy, Linux underneath, and leading the resolution of critical incidents you have driven;
Docker in production (~3+ years): real operation of containers in critical environments — volumes, networking, resource management, graceful shutdown of services;
High autonomy: receives an ambiguous problem ("our observability is weak") and delivers end-to-end;
Ownership and proactivity: anticipates problems before they become incidents;
Clear communication and technical influence, connecting development, infrastructure and business teams;
Conducts post-mortems focused on root cause, organizational learning and continuous improvement, without a blame culture;
Maturity to self-manage while working remotely.