Site Reliability Engineer

Posted 133ds ago

Employment Information

Industry

Education

Salary

Experience

Job Type

Location

Report this job

Job expired or something wrong with this job?

Job Description

Site Reliability Engineer ensuring infrastructure scalability, reliability, and performance for SupplyHouse.com, specializing in HVAC, plumbing, heating, and electrical supplies.

Responsibilities:

Ensure the scalability, reliability, and performance of our infrastructure and applications with a focus on automation, monitoring, and incident response
Design, build, and maintain scalable, reliable systems on GCP (Compute Engine, GKE, Cloud Storage, Cloud SQL)
Develop automation for infrastructure provisioning using Terraform, Ansible, or Deployment Manager
Build and maintain observability platforms (monitoring, logging, tracing) using tools such as Stackdriver (Cloud Monitoring), Prometheus, or Grafana
Manage incident response, conduct postmortems, and implement improvements to reduce recurrence
Partner with DevOps and engineering teams to enhance CI/CD pipelines for resilient deployments
Define and monitor SLAs, SLOs, and SLIs to ensure application availability and performance
Implement disaster recovery (DR) and backup strategies across cloud services
Continuously optimize performance, capacity, and cost-efficiency of GCP resources

Requirements:

Bachelors degree in Computer Science, Engineering, or a related field
3+ years of hands-on experience as a Site Reliability Engineer, DevOps Engineer, Systems Engineer, or Cloud Infrastructure Engineer. Proven track record managing production-grade systems on Google Cloud Platform (GCP) or other cloud providers
Strong understanding of Linux/Unix system administration, networking, and troubleshooting.
Experience implementing Infrastructure as Code (IaC) using tools like Terraform, Ansible, or Deployment Manager
Familiarity with containerization and orchestration technologies such as Docker and Kubernetes (GKE)
Experience with monitoring and observability tools (Google Cloud Operations Suite, Prometheus, Grafana, Datadog, ELK).
Experience defining and monitoring SLAs, SLOs, and SLIs to ensure application uptime and performance.
Proven ability to handle incident response, conduct postmortems, and drive root cause analysis
Proficiency in at least one scripting language (Python, Bash, or Go) for automation and tooling. Hands-on experience building or managing CI/CD pipelines (Jenkins, GitLab CI, Cloud Build).
Strong background in configuration management and release automation
Knowledge of IAM (Identity and Access Management), network security, and cloud compliance controls. Familiarity with disaster recovery (DR), backups, and high-availability design

Benefits:

Comprehensive and affordable medical, dental, vision, and life insurance options
Competitive Provident Fund contributions
Paid time off and holidays
Mental health support and wellbeing program
Company-provided equipment and one-time $250 USD work from home stipend
$750 USD annual professional development budget
Company rewards and recognition program
And more!

Site Reliability Engineer

Employment Information

Report this job

Job Description

Responsibilities:

Requirements:

Benefits:

SupplyHouse.com

Report this job

Similar Jobs

South Geeks

Digibee

Salesforce

Proofpoint

Oscilar

Expleo Group

Jusbrasil

Verity Group

Visionary Integration Professionals (VIP)

Xenon Seven

ZigZag Offshoring

Ford Motor Company

easybill GmbH

DonWeb

IRIUM

TechInsights

Keiki

Hewlett Packard Enterprise

General Dynamics Information Technology

TechInsights