Site Reliability Engineer – Insurance Platform

Posted 8hrs ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Site Reliability Engineer ensuring operational stability and reliability of BJAK’s insurance automation platform. Collaborating with engineering teams for system improvements and incident management.

Responsibilities:

  • Own reliability and operational stability of BJAK’s production systems.
  • Design and improve monitoring, alerting, logging and observability across services.
  • Lead incident response, troubleshooting and structured root cause analysis.
  • Improve system resilience through redundancy, failover and recovery strategies.
  • Work with engineers to design systems that are reliable, scalable and operable in production.
  • Improve deployment safety through CI/CD pipelines, release strategies and automation.
  • Reduce recurring incidents by identifying root causes and driving long-term fixes.
  • Manage and optimize cloud infrastructure supporting business-critical workflows.
  • Strengthen operational practices including on-call processes, incident playbooks and SLAs.
  • Continuously improve system uptime, performance and operational maturity.

Requirements:

  • Experience in Site Reliability Engineering, DevOps, platform engineering or infrastructure roles.
  • Strong understanding of distributed systems, cloud infrastructure and production operations.
  • Experience with monitoring, alerting and observability tools.
  • Strong troubleshooting skills for production incidents and system failures.
  • Ability to design for reliability, scalability and fault tolerance.
  • Experience working with CI/CD pipelines and deployment automation.
  • Strong understanding of system performance, capacity planning and risk management.
  • Hands-on ownership mindset during incidents and operational issues.
  • Calm, structured and disciplined approach to production environments.
  • Strong collaboration with engineering teams in fast-paced environments.
  • Bonus Points
  • Experience with AWS, GCP, Azure or similar cloud platforms.
  • Experience with Kubernetes, Docker or container orchestration systems.
  • Experience with infrastructure-as-code tools (Terraform, Ansible, etc).
  • Experience with observability stacks (Prometheus, Grafana, ELK, Datadog, etc).
  • Experience with incident management tools and on-call systems.
  • Experience with zero-downtime deployments and progressive delivery strategies.
  • Experience working in fintech, insurance or regulated industries.
  • Experience building reliability frameworks or SRE best practices in scaling systems.
  • Contributions to platform reliability or infrastructure resilience initiatives.

Benefits:

  • Build Reliable Insurance Systems – Support mission-critical automation at scale.
  • High-Impact Engineering – Solve real-world reliability and distributed systems challenges.
  • Global Engineering Team – Work with experienced engineers across multiple countries.
  • Fully Remote – Work remotely from China while collaborating with our Malaysia-based teams.
  • International Exposure – Build systems used across Southeast Asia markets.
  • Learning & Development Budget – Support continuous technical growth and certifications.
  • High Ownership Environment – Strong autonomy over reliability and operational design.
  • Modern Engineering Culture – Focus on stability, observability and engineering excellence.
  • Competitive Compensation – Attractive salary package based on experience and impact.