Lead SRE – Observability

Posted 2ds ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Lead Site Reliability Engineer enhancing observability and telemetry platform for athenahealth's cloud infrastructure. Collaborating with engineering teams to improve reliability and operational efficiency.

Responsibilities:

  • Build and operate scalable observability and telemetry platforms that process logs, metrics, traces, and events across production environments
  • Support monitoring, alerting, and instrumentation strategies that improve service visibility and operational insight
  • Partner with engineering teams to strengthen telemetry collection and overall observability
  • Design resilient, automated infrastructure and platform services that improve reliability, scalability, and efficiency
  • Develop Infrastructure as Code and automation solutions that reduce toil and improve consistency
  • Lead technical initiatives from architecture through implementation with attention to performance, reliability, security, and maintainability
  • Troubleshoot complex production issues involving distributed systems, Linux infrastructure, networking, cloud services, and telemetry pipelines
  • Participate in incident response and on-call processes
  • Help drive operational excellence, root cause analysis, and continuous improvement
  • Mentor engineers on SRE best practices, observability strategy, and scalable systems design
  • Contribute to long-term platform strategy and reliability improvements.

Requirements:

  • 7+ years of experience operating and engineering large-scale production infrastructure and distributed systems
  • Strong expertise in Linux systems engineering, cloud infrastructure, and SRE practices
  • Proven experience designing and operating observability and telemetry platforms
  • Hands-on experience with tools such as OpenSearch/Elasticsearch, Kafka, Prometheus, Grafana, Vector, Fluentd, OpenTelemetry, ClickHouse, or similar
  • Experience building Infrastructure as Code solutions using Terraform, CloudFormation, or equivalent tooling
  • Strong automation and software engineering skills using Python, Golang, or Bash
  • Experience troubleshooting large-scale distributed systems in production with a focus on availability, performance, scalability, and resiliency
  • Experience operating services in cloud-native environments, including AWS and containerized platforms
  • Strong understanding of monitoring strategy, telemetry pipelines, incident response, root cause analysis, and operational excellence
  • Ability to communicate effectively across engineering organizations and influence technical decision-making.

Benefits:

  • Health and financial benefits
  • Tuition assistance
  • Employee resource groups
  • Collaborative workspaces
  • Flexible work-life balance