Site Reliability Engineer
Posted 2hrs ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Site Reliability Engineer Specialist at Digibee implementing observability standards and leading incident response. Join a global team innovating in the integration market.
Responsibilities:
- Own the technical direction of our observability stack (Dash0, OpenTelemetry, Elasticsearch/Logstash/Fluent Bit) — defining instrumentation standards for Java and Node.js services and driving adoption of tracing, metrics, and structured logging.
- Establish meaningful SLIs, SLOs, and error budgets, and collaborate with engineering and product teams to use them in real engineering decision-making.
- Lead incident response as a senior commander, and conduct blameless postmortems with technical depth and real follow-up.
- Evolve our on-call program to be humane and sustainable — reducing unnecessary work and alert noise as a first-class engineering priority.
- Influence architectural decisions across the platform, going deep where it matters: GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.
- Mentor SREs and platform engineers, raise the technical bar through design and incident reviews, and help grow the SRE discipline at Digibee.
Requirements:
- 8+ years in SRE, infrastructure, or platform engineering, with significant time at Specialist or Principal level operating large-scale production systems — this is a mandatory requirement.
- Production experience with Kubernetes (preferably GKE), including real fluency in debugging issues under pressure.
- Strong observability expertise with OpenTelemetry, Prometheus, distributed tracing, and centralized logging (Elasticsearch, Logstash, Fluent Bit, or similar). Experience with Dash0 is a strong plus.
- Hands-on experience operating stateful services in production: at least two of the following: PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (MinIO/S3).
- Experience instrumenting and troubleshooting Java services (JVM tuning, GC, thread dumps); familiarity with Node.js runtime characteristics is a plus.
- Proven track record of leading incident response and SLO programs that genuinely changed engineering behavior — not dashboards that nobody looks at.
- Demonstrated ability to mentor senior engineers and influence technical direction across teams without formal authority.
- Communication skills in English and Portuguese (written and verbal), with the ability to collaborate in cross-functional, remote teams.
Benefits:
- Flexibility and autonomy at work
- Opportunity for growth and real impact


















