Site Reliability Engineer

Posted 2hrs ago

Employment Information

Industry

Education

Salary

Experience

Job Type

Location

Report this job

Job expired or something wrong with this job?

Job Description

Site Reliability Engineer Specialist at Digibee implementing observability standards and leading incident response. Join a global team innovating in the integration market.

Responsibilities:

Own the technical direction of our observability stack (Dash0, OpenTelemetry, Elasticsearch/Logstash/Fluent Bit) — defining instrumentation standards for Java and Node.js services and driving adoption of tracing, metrics, and structured logging.
Establish meaningful SLIs, SLOs, and error budgets, and collaborate with engineering and product teams to use them in real engineering decision-making.
Lead incident response as a senior commander, and conduct blameless postmortems with technical depth and real follow-up.
Evolve our on-call program to be humane and sustainable — reducing unnecessary work and alert noise as a first-class engineering priority.
Influence architectural decisions across the platform, going deep where it matters: GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.
Mentor SREs and platform engineers, raise the technical bar through design and incident reviews, and help grow the SRE discipline at Digibee.

Requirements:

8+ years in SRE, infrastructure, or platform engineering, with significant time at Specialist or Principal level operating large-scale production systems — this is a mandatory requirement.
Production experience with Kubernetes (preferably GKE), including real fluency in debugging issues under pressure.
Strong observability expertise with OpenTelemetry, Prometheus, distributed tracing, and centralized logging (Elasticsearch, Logstash, Fluent Bit, or similar). Experience with Dash0 is a strong plus.
Hands-on experience operating stateful services in production: at least two of the following: PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (MinIO/S3).
Experience instrumenting and troubleshooting Java services (JVM tuning, GC, thread dumps); familiarity with Node.js runtime characteristics is a plus.
Proven track record of leading incident response and SLO programs that genuinely changed engineering behavior — not dashboards that nobody looks at.
Demonstrated ability to mentor senior engineers and influence technical direction across teams without formal authority.
Communication skills in English and Portuguese (written and verbal), with the ability to collaborate in cross-functional, remote teams.

Benefits:

Flexibility and autonomy at work
Opportunity for growth and real impact

Site Reliability Engineer

Employment Information

Report this job

Job Description

Responsibilities:

Requirements:

Benefits:

Digibee

Report this job

Similar Jobs

South Geeks

Salesforce

Proofpoint

Oscilar

Expleo Group

Jusbrasil

Verity Group

Visionary Integration Professionals (VIP)

Xenon Seven

ZigZag Offshoring

Ford Motor Company

easybill GmbH

DonWeb

IRIUM

TechInsights

Keiki

Hewlett Packard Enterprise

General Dynamics Information Technology

TechInsights

TechInsights