Site Reliability Engineer

Posted 62ds ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Site Reliability Engineer ensuring stability and scalability of cloud infrastructure for Partly. Collaborating with cross-functional teams to optimize costs and troubleshoot systems.

Responsibilities:

  • Reliability Engineering: Ensure the stability, scalability, and security of our cloud infrastructure, Partly & 3rd party applications in our Kubernetes powered clusters. Leverage Infrastructure-as-Code and automation (Terraform for GCP, GitOps with ArgoCD, Custom scripts in Python/Bash, etc.) to deploy and manage workloads and resources in a repeatable, automated way.
  • Cost Optimisation: Monitor and optimise costs across our cloud and on-prem infrastructure, ensuring we get maximum value from our investments. Make recommendations for resource allocation or architecture changes to improve cost-efficiency without sacrificing reliability or performance.
  • Cross-Functional Collaboration: Work closely with developers, data engineers, and leadership to plan infrastructure needs and improvements. Provide tooling, guidance and training to the engineering team on SRE practices, and collaborate during software delivery to ensure smooth integrations from code to production.
  • Software Engineering: Make sure our software meets high production readiness standards. When you see a problem or an opportunity to improve, you drive the solution.
  • Troubleshooting: participate in incidents resolutions, give developers helping hand in debugging applications, networks, databases, compute systems.

Requirements:

  • Software Engineering: You excel at developing and maintaining large, established software systems beyond simple scripts and utilities. You definitely know what makes software maintainable and you are able to write robust code.
  • Firmly grounded computer science fundamentals: Including data structures, concurrency, architecture, APIs, testing, and design patterns.
  • System engineering fundamentals: You most likely know how to deploy and use memory or stack sampling profiler, how to locate excessive lock contention, how to identify network issues, etc.
  • SRE Expertise: Hands-on experience with modern SRE practices and tooling – for example, containerization (Docker/Kubernetes), infrastructure-as-code (Terraform), and GitOps workflows (ArgoCD or equivalent). You have designed, built, and maintained scalable infrastructure and CI/CD systems.
  • Cloud & Systems Knowledge: Deep familiarity with at least one major cloud platform and Linux operating system. You can tune servers, manage databases/storage, and wrangle Kubernetes clusters.
  • Ownership & Leadership: High degree of ownership and bias for action, with a proactive approach to solving problems. You take initiative and don’t wait to be told what to do. You have demonstrated leadership through mentoring junior engineers or leading small teams/projects, even if not formally a manager. We’re seeking a track record of ownership over critical systems and successful delivery of complex projects.
  • Collaboration & Communication: Excellent communication skills (written and verbal) and a collaborative attitude. You can work across teams and departments – from explaining technical issues to non-technical colleagues, to coordinating with engineers on deployments. You value teamwork and knowledge sharing.
  • Adaptability: Willingness to wear multiple hats and adapt to evolving needs. In a fast-growing startup environment, requirements can change – you’re excited by the chance to learn new skills, take on new challenges, and grow with the role.
  • Bonus Points: Experience in a high-growth startup environment, which means you’re used to the pace and ambiguity. Any prior experience maintaining security compliance and certifications in a company is a plus. If you have used specific tools we use (GCP, ArgoCD, GitLab CI, Kafka, etc.), that’s great – if not, you can learn quickly. If you have significant experience running production workloads over Apache Cassandra and / or Postgres database. If you developed software in Rust programming language and can mentor other developers on the best practices in Rust.

Benefits:

  • Take time when you need it.
  • Zero-hierarchy & no ‘new joiner projects’.
  • Dedicated Employee Experience Team.
  • Competitive base salary + equity.
  • Parental leave and flexible return to work.
  • Flexible working hours.
  • Focus Days & Ergonomic workspace.
  • Generous relocation allowance.
  • Brand new, architecturally designed offices in Christchurch CBD and on Auckland’s Karangahape Road.
  • Team connection.
  • Sustainable Workplace.
  • Regular L&D opportunities.
  • Quarterly full team weeks.
  • Annual global Offsite.