Principal Software Engineer – Rack Scale Systems Infrastructure

Posted 11hrs ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Principal Software Engineer at NVIDIA building software systems for rack-scale infrastructure capabilities. Collaborating across teams to develop dependable, manageable, and programmable solutions for AI-powered applications.

Responsibilities:

  • Define the complete software architecture for rack-scale infrastructure products and services, covering control plane services, infrastructure management, firmware, operating systems, kernel drivers, networking fabrics, accelerator software, and user-mode manageability software.
  • Use Kubernetes and cloud-native primitives as an infrastructure fabric when appropriate. This includes controllers, operators, reconciliation loops, and open source components. These components can operate safely at rack and fleet scale.
  • Build open source infrastructure software that can be embraced in different forms, including libraries, services, controllers, operators, and integration APIs for internal deployments and CSP environments.
  • Bridge hardware and software teams across firmware, BMC, BIOS, boot flows, OS images, drivers, networking, NVLink domains, InfiniBand, GPUs, DPUs, CPUs, and system management interfaces.
  • Translate forward-looking infrastructure roadmaps into formal software requirements, architecture specifications, and execution plans that align teams across the organization.
  • Partner directly with hyperscalers, CSPs, enterprise customers, internal component leads, vendors, and business partners to align infrastructure capabilities with real-world deployment and integration needs.
  • Establish reliability, security, validation, and left-shift strategies that reduce risk before hardware reaches production environments.
  • Mentor senior engineers and technical leads, raising the engineering bar for large-scale networked systems, foundational software, and rack-scale control plane development.
  • Make high-quality technical decisions in ambiguous environments, balancing customer needs, schedule, hardware realities, software maintainability, open source adoption, and long-term infrastructure evolution.

Requirements:

  • BS or MS in Computer Engineering, Computer Science, Electrical Engineering, or a related field, or equivalent experience.
  • Proven experience (15+ years) in systems architecture, system software, distributed systems, infrastructure control planes, or infrastructure engineering.
  • Solid architectural knowledge of coordination frameworks, state machines, declarative APIs, reconciliation loops, lifecycle orchestration, failure handling, upgrade and rollback workflows, and distributed systems tradeoffs.
  • Practical coding skills in Go, C++, or Rust, encompassing the capability to write, review, and direct production-quality infrastructure software.
  • Experience with Rust is highly valued.
  • Experience with Kubernetes or similar orchestration systems, especially as a fabric for managing infrastructure, hardware resources, or large-scale infrastructure services.
  • Experience with Linux-based infrastructure software, OS rollout and image management, kernel or driver interactions, firmware lifecycle, and hardware bring-up workflows.
  • Strong understanding of data center networking technologies and protocols, such as Ethernet, InfiniBand, RDMA, and fabric-level manageability.
  • Experience with complex accelerator-based systems, including GPUs, DPUs, FPGAs, custom silicon, or other high-performance computing systems.
  • Expertise in in-band and out-of-band management architectures, including BMCs, Redfish, IPMI, and related system management protocols.
  • Ability to work with security experts to define practical tradeoffs across secure boot, attestation, access control, update safety, serviceability, and ease of operation.
  • Experience crafting software intended for open source release, including API stability, modularity, documentation, community usability, and clean separation between shared software and deployment-specific integrations.
  • Experience using AI-assisted development tools responsibly as an engineering multiplier for coding, test generation, debugging, build iteration, and documentation.
  • Established skill in specifying requirements, guiding architecture, and managing delivery across various engineering teams and organizations.
  • Strong written and verbal communication skills, enabling clear explanation of complex hardware/software tradeoffs to engineering leaders, customers, partners, and executives.

Benefits:

  • equity
  • benefits