Site Reliability Engineer, Monitoring and Control Engineering
Posted 8ds ago
Employment Information
Job Description
Site Reliability Engineer responsible for NBCU's Distribution Engineering monitoring and control systems. Utilizing automation and on-call support, to ensure high availability.
Responsibilities:
- Utilize scripting and automation to develop, customize and enhance monitoring/alerting tools for “on-air” environments
- Interact with automated monitoring infrastructure to ensure healthy environments
- Create system dashboards that improve system availability and reliability
- Query data stores to quantify the scope of reported issues
- Create new metrics and identify monitoring deliverables to improve site reliability
- Act as a Level 2 resource, drive and own investigations related to Broadcast issues and report back findings in a timely manner to leadership and operations.
- This role requires on-call 24/7 support on a rotating shift schedule
- Follow up with team members & 3rd party vendors if issues found cannot be solved and drive vendors for root cause and solutions if possible.
- Create comprehensive documentation outlining the intricacies of encountered issue, elucidating the root cause and steps for effective issue resolution.
- Administer monitoring and control systems within the “on-air” environments
- Develop proof of concept deployments for evaluation of products and architectures
- Utilize modern frameworks and scripting languages to develop products and services for NBCU's IP video distribution environment
Requirements:
- Bachelor’s degree in computer science or related degree
- Experience with IP video and broadcast technologies
- 3-5+ yrs experience with monitoring and alerting tools i.e. Grafana, Splunk, ELK Stack, Dataminer
- Ability to develop end-to-end monitoring dashboards, alerts and reports for enterprise level environments
- 3-5 years of SRE experience in the technology sector supporting and maintaining production-quality software or software-defined infrastructure in a high traffic environment run in a cloud environments (AWS preferred)
- Ability to collect data from various systems using COTS APIs
- Experience with scripting languages and tools i.e C#, Python, Bash
- Experience with modern frontend technologies like Vite, React, NodeJS, Typescript
- Experience with configuration management technology i.e. Ansible, Salt, and/or Chef
- Experience with public cloud platforms such as AWS, GCP or Azure
- Experience with networking and cloud-based network environments
- Experience with containerization Docker & Kubernetes
- Experience with CI/CD build (Github Actions), deployment practices, and Infrastructure as Code (Terraform)
- Experience in administrating Linux and Windows environments
- Ability to use Agile process for project management, development & tracking
- Comfortable working in a fast-paced agile environment. Requirements change quickly and our team needs to adapt to moving targets.
Benefits:
- medical, dental, and vision insurance
- 401(k)
- paid leave
- tuition reimbursement
- various other discounts and perks




















