HPC Solution Architect
Posted 5ds ago
Employment Information
Report this job
Job expired or something wrong with this job?
Job Description
Senior Principal Software Engineer designing and deploying large-scale HPC and AI clusters for Dell Technologies. Engaging with diverse customer needs in a collaborative technical environment.
Responsibilities:
- Lead customer architecture & design, translating HPC/AI workload requirements into scalable cluster architectures
- Deploy and operationalize clusters using Omnia or similar automation
- Build and maintain provisioning workflows (OpenCHAMI-based or equivalent)
- Serve as Tier-3 engineering escalation, troubleshooting complex provisioning, scheduling, GPU, networking, and performance issues
- Contribute to open source and customer enablement through code contributions, documentation, workshops, runbooks, templates, and field readiness materials
Requirements:
- 8+ years engineering large-scale HPC and distributed infrastructure
- Strong knowledge of cluster architecture, schedulers, and provisioning workflows
- Deep experience with RHEL/Rocky/Ubuntu
- Hands-on cluster deployments using open-source toolchains, Omnia, and OpenCHAMI
- Production experience with Slurm and/or Kubernetes
- Proficient with Docker/Podman, OpenTelemetry pipelines, and telemetry instrumentation
- Strong skills in Ansible, Python, Bash
- Expertise with Prometheus and Grafana dashboards
Benefits:
- Health insurance
- Retirement plans
- Professional development opportunities
- Paid time off


















