Lead SRE – Observability

Posted 2ds ago

Employment Information

Industry

Education

Salary

Experience

Job Type

Location

Report this job

Job expired or something wrong with this job?

Job Description

Lead Site Reliability Engineer enhancing observability and telemetry platform for athenahealth's cloud infrastructure. Collaborating with engineering teams to improve reliability and operational efficiency.

Responsibilities:

Build and operate scalable observability and telemetry platforms that process logs, metrics, traces, and events across production environments
Support monitoring, alerting, and instrumentation strategies that improve service visibility and operational insight
Partner with engineering teams to strengthen telemetry collection and overall observability
Design resilient, automated infrastructure and platform services that improve reliability, scalability, and efficiency
Develop Infrastructure as Code and automation solutions that reduce toil and improve consistency
Lead technical initiatives from architecture through implementation with attention to performance, reliability, security, and maintainability
Troubleshoot complex production issues involving distributed systems, Linux infrastructure, networking, cloud services, and telemetry pipelines
Participate in incident response and on-call processes
Help drive operational excellence, root cause analysis, and continuous improvement
Mentor engineers on SRE best practices, observability strategy, and scalable systems design
Contribute to long-term platform strategy and reliability improvements.

Requirements:

7+ years of experience operating and engineering large-scale production infrastructure and distributed systems
Strong expertise in Linux systems engineering, cloud infrastructure, and SRE practices
Proven experience designing and operating observability and telemetry platforms
Hands-on experience with tools such as OpenSearch/Elasticsearch, Kafka, Prometheus, Grafana, Vector, Fluentd, OpenTelemetry, ClickHouse, or similar
Experience building Infrastructure as Code solutions using Terraform, CloudFormation, or equivalent tooling
Strong automation and software engineering skills using Python, Golang, or Bash
Experience troubleshooting large-scale distributed systems in production with a focus on availability, performance, scalability, and resiliency
Experience operating services in cloud-native environments, including AWS and containerized platforms
Strong understanding of monitoring strategy, telemetry pipelines, incident response, root cause analysis, and operational excellence
Ability to communicate effectively across engineering organizations and influence technical decision-making.

Benefits:

Health and financial benefits
Tuition assistance
Employee resource groups
Collaborative workspaces
Flexible work-life balance

Lead SRE – Observability

Employment Information

Report this job

Job Description

Responsibilities:

Requirements:

Benefits:

athenahealth

Report this job

Similar Jobs

Tech9

Malwarebytes

Minor Hotels Europe and Americas

YPO

Get Well

Origami Risk

Red Hat

IRIUM

In All Media

Generac

Lyric - Clarity in motion.

Raízen

Coinbase

Coinbase

Coinbase

Harbor IT

Syniti

General Dynamics Information Technology

Cadmus Soluções em TI

Ad Hoc LLC