Site Reliability Engineer
Posted 13hrs ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Site Reliability Engineer managing and troubleshooting customer-facing distributed systems for StarTree. Collaborating with engineers to automate and enhance performance in real-time analytics.
Responsibilities:
- Leverage various monitoring and alerting services to solve intricate programming problems at scale.
- Manage and tune multiple critical customer-facing Apache Pinot clusters
- Monitor availability, read/write latencies, and other key telemetry to proactively identify SLO misses and help mitigate issues
- Build a rapport with and work closely with customers to mitigate and resolve incidents
- Execute disaster recovery strategies with minimal downtime
- Collaborate with other engineers to understand and troubleshoot systems and use the experience gained to influence the roadmap of other teams
Requirements:
- 5+ years of experience as an engineer (SRE, SDET, or development)
- Experience managing highly available production facing distributed systems and in-depth knowledge of Java are a plus
- Experience with cloud platforms such as AWS, GCP, or Azure
- Experience with Kubernetes and container orchestration
- Familiarity with streaming systems, such as Kafka, Pulsar, Flume, Flink, Spark, or similar
- Knowledge of standard methodologies related to security, performance, and disaster recovery
- Strong troubleshooting and critical thinking skills
Benefits:
- Health insurance
- Flexible work arrangements
- Professional development



















