Site Reliability Engineer

Posted 115ds ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Site Reliability Engineer enhancing operational excellence and uptime at Wormhole Foundation. Collaborating with engineering and DevOps for critical blockchain services.

Responsibilities:

  • Act as first responder and incident commander during production incidents
  • Lead incident triage, root cause analysis, and retrospective documentation
  • Build detailed incident timelines and preventative runbooks
  • Respond to incidents related to: performance issues, CCQ failures or degraded throughput, observability pipeline outages, and core Wormhole products
  • Deliver remediation recommendations and implement approved fixes
  • Improve reliability and uptime across all Wormhole services
  • Strengthen observability, monitoring, and alerting systems
  • Harden infrastructure for security and operational resiliency
  • Enhance deployment workflows and reduce operational friction
  • Lead incident response, analysis, and continuous improvement
  • Support operational tooling used by engineering, DevOps, and validator partners

Requirements:

  • Relevant tertiary qualifications in computer science or a closely related field (bachelors/masters) and/or relevant work experience over at least five years
  • Established experience as incident commander across multiple stakeholders in global team
  • Familiarity with metrics and log analysis tools (e.g., Grafana), incident response tools (e.g., PagerDuty), GitHub administration and related tools
  • Deep understanding of reliability engineering, observability, and incident response for distributed systems
  • Ability to write and debug code in any of the following: Go, Rust, Java
  • Strong experience operating in Grafana or Datadog or Splunk and/or Kubernetes in production environments
  • Experience securing distributed systems and public-facing infrastructure
  • Ability to operate independently, document clearly, and lead during incidents
  • Solid understanding of cloud computing environments (AWS and GCP preferred) and willingness to keep up to date with their changing offerings.
  • Excellent and proactive written and verbal communication
  • Ideal candidate will be based in ET or GMT time zone or the ability to work those hours.