Slurm Administration & Systems Architecture (Santa Clara) Job at Midjourney, Santa Clara, CA

TVVWQ2tCU1drUUFqYkpkT3NBRWRhL2ZOVUE9PQ==
  • Midjourney
  • Santa Clara, CA

Job Description

Overview

We are seeking a highly skilled HPC/AI/ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI/ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC/AI/ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI/CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog/prolog, pam_slurm_adopt) to extend functionality and integrate with authentication,and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.

System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP/SSSD, VPN, PAM, SSH session auditing).

User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker/Podman/Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.

Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI/ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS/preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100/200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI/CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.

Job Tags

Part time,

Similar Jobs

Labcorp

Float Phlebotomist- Deerfield Beach Job at Labcorp

 ...Float Phlebotomist- Deerfield Beach at Labcorp summary: A Float Phlebotomist at Labcorp performs blood collections using venipuncture and capillary techniques across multiple locations, working with diverse patient populations. The role involves specimen preparation... 

PRIDE Health

Travel Correctional Mental Health Coordinator Job at PRIDE Health

 ...Employment Type: Travel \n\t\t\t\t\t\t\t "In our organization, these professionals are placed in the position of Mental Health Coordinator to plan and provide clinical mental health services of a non-medical nature in the facility. \n The successful candidate... 

Genuine Search Group

Process Technician (2nd Shift) Job at Genuine Search Group

 ...in Medical Device and other custom manufactured parts. About the Role: We are currently looking for a 2nd Shift Set up / Process Technician for the New Port Richey...  ...to achieve quality parts, optimized cycle times and scrap minimization. Encourage and train machine... 

Patterson

Inside Sales Rep Job at Patterson

 ...stronger and successful organization. Job Summary As an Inside Sales Representative (ISR), you are responsible for generating and...  ...sight, requires frequent review of customer information Travel and On-call ~ This position provides the opportunity for... 

Jobot

Senior Associate, Transaction Services Job at Jobot

 ...000 per year A bit about us: We are working closely with an advisory firm looking to hire a Transaction Services Senior Associate. Check out the details below! Why join us? Up to $130k base + Bonuses + 20% commission on all new clients Merit-based promotions...