SRE Lead
Posted 65ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
SRE Lead responsible for production reliability overseeing blockchain infrastructure and team operations. Lead the SRE team, manage incidents, and drive operational excellence for multi-region platforms.
Responsibilities:
- Lead and grow the SRE team: hiring, onboarding, 1:1s, performance reviews, and career development.
- Own SRE operating cadence: prioritization, planning, execution, and visibility of reliability work.
- Maintain high standards for production readiness: runbooks, operational checklists, change management, and quality gates.
- Own production reliability end-to-end across gateways, clusters, and blockchain node fleets.
- Define and evolve SLIs/SLOs for uptime, response time, RPS, and time-to-resolve; partner with engineering teams to meet targets.
- Own incident management standards: alerting strategy, escalation, incident coordination, and communications.
- Run and improve postmortems: ensure follow-ups are executed and reliability debt is reduced over time.
- Lead capacity planning and performance work across regions and chains; balance reliability, speed, and cost.
- Lead design reviews and set engineering standards for reliability, scalability, and operational excellence.
- Drive architecture decisions across Nomad + Kubernetes environments, gateways, and observability stack.
- Build and evolve internal tooling that improves reliability and operational efficiency (automation, health systems, diagnostics, self-service).
Requirements:
- 3+ years in SRE / infrastructure / production engineering, including 1+ year leading people
- Strong Linux, networking, and production incident debugging skills
- Experience running and scaling distributed, multi-region, high-load systems
- Hands-on with orchestration (Nomad and/or Kubernetes) and modern gateways/proxies
- Solid observability practices (metrics, logs, traces, alerting, incident response)
- Using AI agents to improve operational efficiency and reliability automation
- Strong communication and ability to lead technical decisions end to end
- Nice to have: Web3 / RPC infrastructure and blockchain node operations
- HashiCorp stack (Nomad, Consul, Vault), Prometheus ecosystem
- Terraform / IaC, capacity & cost modeling, DDoS and abuse protection
- Building internal platforms: self-service tools, runbooks, reliability automation.
Benefits:
- 20 days of annual leave, plus an additional 12 days off to use for your holidays or personal days.
- Well-being programs to support your health and balance.
- Coworking space compensation for a productive work environment.
- Paid sick leave to ensure you can rest when needed.
- A company that invests in your growth, with personalized roadmaps to guide your professional development.
- An actively growing company with great opportunities for both horizontal and vertical career development.
- Opportunity to shape the initiatives you’re working on and make a real impact.
Report this job
Job expired or something wrong with this job?



















