Principal Site Reliability Engineer
Posted 2hrs ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Principal Site Reliability Engineer responsible for AWS infrastructure and reliability engineering. Collaborating across teams to enhance platform performance and security practices.
Responsibilities:
- Own and evolve our AWS-based infrastructure, improving platform performance and availability today, and building toward deployable configurations that support enterprise customer environments tomorrow.
- Own EKS cluster operations across production regions: node pool strategy, AMI lifecycle, autoscaling, and Kubernetes workload health.
- Support the GitOps deployment pipeline - define, deploy, and manage applications across clusters using infrastructure-as-code.
- Manage complex networking: VPC design, cross-region connectivity, DNS, and load balancing.
- Lead infrastructure deprecation and migration efforts with minimal disruption.
- Own SLO measurement infrastructure; enable proactive triage of emerging issues before they impact customers.
- Lead incident investigation, root cause analysis and postmortems, driving systemic fixes rather than one-off patches.
- Design and improve automated remediation systems to reduce MTTR.
- Review and provide security-conscious feedback on platform architecture decisions.
- Own cloud IAM governance - roles, policies, and access boundaries across accounts and services.
- Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires.
- Partner with application development teams to build an inherently secure platform and drive next-generation deployment architecture.
- Partner with customer teams to ensure availability for expected utilization.
- Partner with Finance on cloud cost optimization - lifecycle policies, right-sizing, and spend visibility.
- Support GPU and batch workloads in collaboration with simulation and ML engineering teams.
- Improve CI/CD pipelines and automated infrastructure validation.
- Support engineering teams with infra-side debugging, log analysis, and environment configuration.
Requirements:
- 5+ years in SRE, DevOps, or infrastructure engineering roles.
- Infrastructure-as-code proficiency - Terraform modules, state management, and multi-environment patterns.
- Deep AWS experience - EKS, EC2, IAM, S3, Storage Gateway, VPC networking, Transit Gateway, CloudFront, KMS, and IRSA.
- Kubernetes expertise - cluster operations, node pools, probes, cordoning, pod scheduling, RBAC, Helm, node autoscaling (Karpenter experience a plus); solid understanding of containerization and AMI lifecycle management.
- CI/CD - experience with GitOps workflows and pipeline tooling (ArgoCD, GitHub Actions, Jenkins)
- Solid networking fundamentals - CIDR design, security groups, DNS, load balancing, VPN, cross-region connectivity.
- Experience with monitoring and observability tooling - Prometheus, Grafana, Elasticsearch.
- Comfort with Python and Bash for tooling and automation.
- Familiarity working across Linux and Windows environments. Operational familiarity with Windows Server is a meaningful advantage.
- You communicate clearly across engineering, product, and customer-facing teams, flagging issues with urgency proportional to customer impact.
- You advocate for SRE best practices and can effectively operationalize an informed and principled view on security.
- You take end-to-end ownership of complex, multi-team efforts - from planning through execution and post-change verification.
- You know when to push for a clean solution vs. when to accept a pragmatic one, and you communicate that tradeoff clearly.
Benefits:
- Health insurance
- Retirement plans
- Paid time off
- Flexible work arrangements
- Professional development opportunities




















