Senior Site Reliability Engineer

Posted 133ds ago

Employment Information

Industry

Education

Salary

Experience

Job Type

Location

Report this job

Job expired or something wrong with this job?

Job Description

Senior Site Reliability Engineer developing reliable systems for HavocAI's autonomous maritime technology. Collaborating with cloud and engineering teams to enhance service resilience and performance.

Responsibilities:

Design and evolve reliability architecture for distributed and cloud-hosted systems.
Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning.
Partner with platform and application teams to design systems for reliability, scalability, and operability.
Identify and mitigate systemic reliability risks across infrastructure and services.
Lead incident response processes including on-call rotations, escalation, and post-incident reviews.
Conduct root cause analysis for complex production incidents and drive long-term improvements.
Improve operational readiness through runbooks, automation, and resilience testing.
Reduce operational toil through tooling, automation, and process improvements.
Design and maintain observability systems for metrics, logging, tracing, and alerting.
Ensure services and data pipelines are observable, debuggable, and performant in production.
Drive performance analysis and tuning across infrastructure and service layers.
Build automation to improve system reliability, deployment safety, and recovery processes.
Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns.
Support and improve Kubernetes-based environments and containerized workloads.
Collaborate with security teams to ensure secure and resilient system design.
Participate in disaster recovery planning and testing.
Maintain strong operational practices around access control, secrets management, and change management.

Requirements:

7+ years of experience in SRE, infrastructure, or systems engineering roles
Strong experience operating large-scale distributed production systems
Deep understanding of Linux systems, networking, and distributed systems fundamentals
Hands-on experience with Kubernetes and container orchestration
Programming or scripting experience in Go, Python, or similar languages
Experience designing and operating observability systems for production environments
Proven ability to lead incident response and reliability improvements
Strong communication skills and ability to collaborate across engineering teams
Must be a US Citizen.
Must be Eligible to obtain a Government Clearance - if required.

Benefits:

100% Employer paid Health, Dental and Vision Insurance for you and your families
Life Insurance (Employer Paid)
Ability to participate in the companies 401k program (Matching)
Unlimited PTO policy with an enforced 2 week minimum
Equity Package
Work / Home Office Stipend
Global Entry
16 Week Paid Parental Leave
Monthly Health and Wellness Stipend

Senior Site Reliability Engineer

Employment Information

Report this job

Job Description

Responsibilities:

Requirements:

Benefits:

HavocAI

Report this job

Similar Jobs

IRIUM

IRIUM

Envision Healthcare

Pear Tree.

Veeam Software

Veeam Software

Arclin

EXL

Celonis

Upstart

Leidos

Ascensus

The Home Depot

Stord

DraftKings Inc.

Smarthis

Addvisor Group

Blackpoint Cyber

Blackpoint Cyber

MyFitnessPal