HPC Solution Architect

Posted 5ds ago

Employment Information

Education
Salary
Experience
Job Type

Report this job

Job expired or something wrong with this job?

Job Description

Senior Principal Software Engineer designing and deploying large-scale HPC and AI clusters for Dell Technologies. Engaging with diverse customer needs in a collaborative technical environment.

Responsibilities:

  • Lead customer architecture & design, translating HPC/AI workload requirements into scalable cluster architectures
  • Deploy and operationalize clusters using Omnia or similar automation
  • Build and maintain provisioning workflows (OpenCHAMI-based or equivalent)
  • Serve as Tier-3 engineering escalation, troubleshooting complex provisioning, scheduling, GPU, networking, and performance issues
  • Contribute to open source and customer enablement through code contributions, documentation, workshops, runbooks, templates, and field readiness materials

Requirements:

  • 8+ years engineering large-scale HPC and distributed infrastructure
  • Strong knowledge of cluster architecture, schedulers, and provisioning workflows
  • Deep experience with RHEL/Rocky/Ubuntu
  • Hands-on cluster deployments using open-source toolchains, Omnia, and OpenCHAMI
  • Production experience with Slurm and/or Kubernetes
  • Proficient with Docker/Podman, OpenTelemetry pipelines, and telemetry instrumentation
  • Strong skills in Ansible, Python, Bash
  • Expertise with Prometheus and Grafana dashboards

Benefits:

  • Health insurance
  • Retirement plans
  • Professional development opportunities
  • Paid time off