Site Reliability Engineer

Posted 13hrs ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Site Reliability Engineer managing and troubleshooting customer-facing distributed systems for StarTree. Collaborating with engineers to automate and enhance performance in real-time analytics.

Responsibilities:

  • Leverage various monitoring and alerting services to solve intricate programming problems at scale.
  • Manage and tune multiple critical customer-facing Apache Pinot clusters
  • Monitor availability, read/write latencies, and other key telemetry to proactively identify SLO misses and help mitigate issues
  • Build a rapport with and work closely with customers to mitigate and resolve incidents
  • Execute disaster recovery strategies with minimal downtime
  • Collaborate with other engineers to understand and troubleshoot systems and use the experience gained to influence the roadmap of other teams

Requirements:

  • 5+ years of experience as an engineer (SRE, SDET, or development)
  • Experience managing highly available production facing distributed systems and in-depth knowledge of Java are a plus
  • Experience with cloud platforms such as AWS, GCP, or Azure
  • Experience with Kubernetes and container orchestration
  • Familiarity with streaming systems, such as Kafka, Pulsar, Flume, Flink, Spark, or similar
  • Knowledge of standard methodologies related to security, performance, and disaster recovery
  • Strong troubleshooting and critical thinking skills

Benefits:

  • Health insurance
  • Flexible work arrangements
  • Professional development