Senior SRE/DevOps Engineer

Posted 69ds ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Site Reliability Engineer for Scentbird focusing on Kubernetes and AWS infrastructure reliability and performance. Collaborating with engineering teams to improve operational efficiency and service availability.

Responsibilities:

  • Use your shift to prevent incidents from ever happening.
  • Run our infrastructure with AWS, Docker and Kubernetes.
  • Make monitoring and alerting alert on symptoms and not on outages.
  • Document every action so your findings turn into repeatable actions–and then into automation.
  • Improve the deployment process to make it as boring as possible.
  • Design, build and maintain core infrastructure pieces that allow Scentbird scaling.
  • Debug production issues across services and levels of the stack.
  • Plan the growth of Scentbird’s infrastructure.

Requirements:

  • Strong hands-on Kubernetes experience required (EKS preferred): cluster operations, workload design, networking, upgrades, and performance troubleshooting.
  • 5+ years production application support experience in a high uptime environment
  • 5+ years UNIX administration experience including diagnosis of performance issues, package management, load estimation, kernel tuning, networking configuration, etc.
  • 5+ years hosting experience in a large heavy-traffic environment
  • 3+ years software engineering experience (Java/TypeScript is plus, but any other programming language is good to know)
  • Strong understanding of networking fundamentals (VPCs, routing, load balancers, DNS, TCP/IP, TLS) and debugging service-to-service connectivity issues.
  • 4+ years experience working with Gitlab CI/CD / Github action
  • Hands-on AWS experience strongly required
  • Hands-on experience building monitoring/alerting/tracing systems (Grafana/Prometheus/ELK/OpenTelemetry).
  • Database experience is a plus (RDS/Aurora/Postgres/Redis/Elasticsearch), especially around scaling, replication, and performance troubleshooting.
  • Service Mesh experience is a plus
  • Security experience is a plus
  • Excellent troubleshooting and analytical skills
  • Ability to work independently on large, complex projects with minimal guidance
  • Excellent troubleshooting and analytical skills with the ability to debug distributed systems under pressure.

Benefits:

  • Competitive base compensation
  • Bonus program
  • Paid Time Off
  • A fun, creative and energetic work environment.