Senior Software Engineer
Posted 10ds ago
Employment Information
Job Description
Senior Software Engineer building Upbound Spaces for cloud infrastructure management. Working on multi-tenant SaaS environments and contributing to open-source projects like Crossplane.
Responsibilities:
- Actively build and operate Upbound Spaces in production, troubleshooting and resolving issues across multi-tenant SaaS environments, as well as contributing to Upbound's open-source projects, including Crossplane.
- Take ownership of building features in high demand by Upbound's customers and deliver new functionality that will delight and amaze our users.
- Investigate and debug complex issues in customer environments, including multi-control plane scenarios, resource reconciliation problems, and performance bottlenecks.
- Communicate through thoughtful and thorough design documents for new initiatives and detailed post-incident reviews that drive system improvements.
- Support the full project lifecycle for highly scalable and reliable services running in a cloud environment – discovery, analysis, architecture, design, review, documentation, building, migration, automation, deployment, production-readiness, and ongoing operational support.
- Write and maintain Go code that interfaces with the Kubernetes API, such as operators, controllers, add-ons, etc., with a focus on observability, debuggability, and operational excellence.
- Deploy, manage, and troubleshoot our Kubernetes services in production, using metrics, logs, and traces to identify and resolve issues quickly.
- Build and maintain operational tooling for debugging customer environments, analyzing control plane health, and automating incident response.
- Author documentation, user guides, runbooks, and blog posts to support and promote new features that you release.
- Support the software release cycle for Spaces self-hosted distributions, including diagnosing issues in customer-managed deployments.
- Participate in on-call rotation to support Upbound Cloud, responding to incidents and driving them to resolution.
Requirements:
- Have experience operating production cloud services at scale: monitoring, alerting, incident response, post-mortems, and continuous improvement of service reliability.
- Have strong debugging skills across distributed systems, including experience with observability tools (Prometheus, Grafana, OpenTelemetry, distributed tracing) and techniques for diagnosing issues in production environments.
- Have experience building and operating controllers that interact with the Kubernetes API server, including troubleshooting reconciliation loops, managing API rate limits, and optimizing controller performance.
- Are comfortable working directly with customers to understand, reproduce, and resolve complex technical issues in their environments.
- Take responsibility and ownership for solving problems even if they are outside your lane, especially during incidents affecting customer workloads.
- Demonstrate excellence in your work, constantly trying to improve your skills and the operational posture of the systems you build.
- Have empathy for customers and keep them in mind as you build solutions, understanding that reliability and debuggability are features.
- Realize the importance of clear communication and effective collaboration to work as a team, deliver great results, and support customers through technical challenges.
- Help create a safe environment where everyone can contribute, learn from failures, share on-call knowledge, and help each other grow as operators and engineers.
Benefits:
- Health insurance
- Retirement plans
- Paid time off
- Flexible work arrangements
- Professional development



















