Senior Software Engineer, Infrastructure

Posted 2hrs ago

Employment Information

Industry

Education

Salary

Experience

Job Type

Location

Report this job

Job expired or something wrong with this job?

Job Description

Senior Software Engineer driving the stability and reliability of Epic's GCP infrastructure. Collaborating with engineering teams to maintain high availability and scalability.

Responsibilities:

Drive the stability and reliability of Epic's GCP infrastructure—setting and tracking SLOs/SLIs, reducing toil, and engineering out recurring sources of instability
Build and operate Epic's GCP infrastructure for high availability, scalability, and cost efficiency
Manage and harden our Docker and GKE container platform, including workload scheduling, autoscaling, networking, and graceful failure handling
Maintain and improve CI/CD pipelines that enable fast, safe, low-risk delivery across engineering teams
Own and evolve the observability stack—metrics, logs, traces, dashboards, and alerts—so that signals are actionable, noise is low, and on-call has the context to resolve issues quickly
Write and maintain Terraform to codify infrastructure across the organization, with a focus on consistency, change safety, and reproducibility
Contribute to capacity planning, cost optimization, and architectural reviews, with reliability as a first-class consideration
Champion platform security best practices, including secrets management, IAM policies, and network segmentation
Support compliance-aware infrastructure practices—vulnerability management, access reviews, audit-evidence flows, and incident-response readiness—as we mature our SOC 2 and student-data compliance programs
Partner with data engineering to operate the orchestration platform and supporting infrastructure—deployment, scaling, reliability, and observability
Collaborate with backend and data engineers to troubleshoot service and platform issues
Lead by example in a frequent on-call rotation; drive incident response, blameless post-mortems, and the follow-through that turns one-time outages into systemic, lasting reliability improvements
Provide guidance to developers on infrastructure concerns and best practices

Requirements:

Bachelor's degree or higher in Computer Science, Software Engineering, or a related field
5+ years of experience in infrastructure, platform, DevOps, or a related engineering role
Hands-on experience with GCP (GCE, GCS, VPC, IAM, Cloud Monitoring, and related services)
Experience with Docker and Kubernetes (GKE)—containerizing workloads, deploying to GKE, Helm, and cluster fundamentals
Experience with CI/CD pipelines (GitHub Actions, ArgoCD, Jenkins, or similar)
Experience with an observability platform such as New Relic (metrics, logging, alerting, dashboards)
Proficiency in Terraform for managing infrastructure as code
Scripting/programming skills in Python, Bash, or similar
Comfort participating in a frequent production on-call rotation
Track record of measurably improving reliability of production systems—e.g., defining SLOs, reducing incident frequency or MTTR, eliminating recurring failure modes
Strong problem-solving skills, sense of ownership, and ability to work effectively in evolving systems
Fluency in English for daily collaboration and technical documentation
Proficiency in Mandarin Chinese to collaborate effectively with global engineering and business partners.

Senior Software Engineer, Infrastructure

Employment Information

Report this job

Job Description

Responsibilities:

Requirements:

Epic Kids

Report this job

Similar Jobs

Cronos Europa

Foundation EGI

finally

Toro TMS

Newfold Digital

Vivecti Group

duvo.ai

Flatiron School

Flatiron School

Flatiron School

ai2io

Flatiron School

Prison Fellowship

bunny.net

bunny.net

CYPHER Learning

QAVION GROUP

vyzn

BSS Mitte GmbH

Artificial Labs