Senior Site Reliability Engineer, SRE

Posted 2hrs ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Senior Site Reliability Engineer at Fable ensuring reliable and scalable infrastructure for AI-driven accessible products. Collaborating across teams to improve operational excellence and platform engineering.

Responsibilities:

  • Design, build, and maintain reliable, scalable, and secure infrastructure for Fable’s product services
  • Improve system observability, monitoring, and alerting to ensure high availability and fast incident response
  • Contribute to and evolve SRE practices, including SLIs/SLOs, incident management, and postmortems
  • Support and improve CI/CD pipelines and deployment processes
  • Identify and reduce operational complexity across systems and tooling
  • Work across infrastructure and application layers to diagnose and resolve reliability and performance issues, including making targeted improvements to application code when needed
  • Support infrastructure and platform capabilities required for AI/ML-powered features, including scaling, performance, and reliability considerations
  • Monitor and optimize infrastructure costs across cloud environments
  • Contribute to capacity planning and cost forecasting for infrastructure and services
  • Identify opportunities to improve performance and efficiency at the system level
  • Evaluate and optimize the cost and performance of compute-intensive workloads (e.g., AI/ML services), ensuring efficient resource usage and scalability
  • Work with third-party vendors and tools that support Fable’s infrastructure and operations
  • Help evaluate, select, and manage tools and services to support platform reliability and scalability
  • Support vendor-related troubleshooting and ongoing service improvements
  • Partner with Engineering teams to improve reliability, performance, and operational readiness of new features
  • Partner with application engineering teams to improve service architecture, performance, and observability, and help define best practices for building reliable, scalable systems
  • Act as a point of support and escalation for production issues
  • Collaborate across teams to manage dependencies and ensure smooth system operations
  • Contribute to building strong SRE and operational practices across the organization
  • Share knowledge through documentation, pairing, and technical discussions
  • Help onboard and support more junior team members as the team grows
  • Contribute to improving ways of working within the team and across Engineering

Requirements:

  • 5–8+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or Platform Engineering
  • Strong experience with cloud infrastructure (AWS, GCP, or Azure)
  • Experience building internal platforms, tooling, or shared services that improve developer productivity and system reliability
  • Experience designing systems that bridge infrastructure and application layers
  • Ability to work across the stack: comfortable reading, debugging, and making changes to application code (e.g., backend services, APIs) when needed to improve reliability, performance, or observability
  • Experience with at least one backend programming language (e.g., Node.js, Python, Go, Java)
  • Strong experience with monitoring, observability, and alerting tools (e.g., Datadog, Prometheus, Grafana)
  • Solid understanding of CI/CD systems and modern deployment practices
  • Experience managing infrastructure as code (e.g., Terraform, CloudFormation)
  • Experience optimizing system performance and infrastructure costs
  • Familiarity with security and compliance considerations in cloud environments
  • Experience working with third-party vendors and infrastructure tools
  • Familiarity with infrastructure considerations for AI/ML workloads (e.g., high-compute services, data pipelines, or third-party AI platforms) is a strong asset
  • Curiosity about emerging technologies and their impact on infrastructure, reliability, and cost at scale
  • Strong problem-solving skills and ability to navigate complex systems
  • Excellent collaboration and communication skills.

Benefits:

  • stock options
  • career growth opportunities
  • professional development support
  • health and dental coverage