Running Multi-Node Tasks with Enroot (without Pyxis and Slurm)¶
This webpage is directly generated from j3soon/multi-node-enroot-without-pyxis-and-slurm.
Introduction¶
Setting up environments for user applications on an HPC cluster is often tedious and divert attention from the application itself. To tackle this, containerization support is a great way to simplify the process. HPC clusters often use Slurm workload manager along with containerization tools such as Singularity/Apptainer, Rootless Docker (environment module), or Enroot+Pyxis, for easier environment management.
Based on my experience working with Slurm and all these containerization options, I personally prefer Slurm with Enroot+Pyxis as it offers the simplest workflow for users familiar with Docker, while also ensuring minimal performance overhead.
The setup instructions are already documented in the official Pyxis repository. Enroot documentation also contains detailed usage guide on single-node tasks. However, there is no documentation for running multi-node tasks directly with Enroot without Pyxis. Using Enroot directly without Pyxis may be needed when you have direct (bare metal) access to multiple Ubuntu nodes and do not want to set up a scheduler or use a workload manager like Slurm. In such cases, Enroot itself alone can serve as a lightweight and effective containerization solution for HPC environments.
This (unofficial) document describes the minimal setup required for running multi-node tasks directly with Enroot without Pyxis and Slurm. Please note that running multi-node tasks with Enroot is more of a hack than a fool-proof solution, the recommended method for multi-node tasks remains using Enroot+Pyxis.
Sample Environment¶
- GPU Hardware: Two nodes, each with eight H200 NVL GPUs
The commands below can be easily adapted to arbitrary number of nodes (or other NVIDIA GPUs).
- Network Hardware: Each node is equipped with four ConnectX-7 NICs, providing eight InfiniBand connections per node through InfiniBand switches
Ethernet (RoCE) should also work in theory.
- OS: Ubuntu 22.04.5 LTS
Other Linux distributions should also work.
- Pre-installed Software: NVIDIA Driver, Docker, NVIDIA Container Toolkit, InfiniBand Driver (DOCA-Host or MLNX_OFED (legacy)).
Not sure if Docker and Container Toolkit are required, but we have them installed by default on all nodes.
- A running OpenSM on the InfiniBand switch or manually launched.
Usually it's already running on the InfiniBand switch.
- All nodes have IP addresses assigned within a private network
Assume require VPN connection to access the nodes.
Prerequisites¶
Most multi-node cluster will have basic user account, NFS, and SSH configured. If not, you'll need to set up these first.
User Account¶
Create a user account (with same username/UID/GID) with sudo privileges on all nodes, with the home directory set to /mnt/home/<username>
. If the user already exists, skip this step.
You'll want to use tools like LDAP to manage the user account. Alternatively, you can manually create the user account on all nodes:
# Create user account
USERNAME=<username>
sudo useradd -m -d /mnt/home/${USERNAME} -s /bin/bash -u 10001 -g 10001 -G sudo ${USERNAME}
# Enable password-less sudo
echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
Network File System (NFS)¶
Set up an NFS server on the head node and mount the shared home directory on all other nodes. If NFS is already configured, ensure the necessary paths are exported and mounted correctly.
On head node:
# Install
sudo apt update
sudo apt install -y nfs-kernel-server
sudo systemctl start nfs-kernel-server.service
# Export
sudo mkdir -p /mnt/home/${USER}
sudo chown -R $(id -u):$(id -g) /mnt/home/${USER}
echo "/mnt/home *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
sudo exportfs -a
On all other nodes:
# Install
sudo apt install -y nfs-common
# Mount
NFS_SERVER=<HEAD-NODE-IP>
sudo mkdir -p /mnt/home
echo "$NFS_SERVER:/mnt/home /mnt/home nfs defaults 0 0" | sudo tee -a /etc/fstab
sudo mount -a
mount | grep nfs
SSH Configuration¶
Skip this step if password-less SSH is already configured.
On head node:
# Generate SSH key
ssh-keygen -t ed25519 # and press Enter multiple times to accept the default values
# Copy to shared home directory (will automatically work on all nodes due to shared home directory)
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys
Install Enroot¶
Download Enroot to the shared directory.
On head node, download Enroot deb files:
cd /mnt/home/${USER}
mkdir -p enroot/deb && cd ~/enroot/deb
arch=$(dpkg --print-architecture)
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.5.0/enroot_3.5.0-1_${arch}.deb
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.5.0/enroot+caps_3.5.0-1_${arch}.deb # optional
On all nodes, install Enroot:
# run once for all nodes (including head node)
IP=<IP>
ssh $IP 'cd ~/enroot/deb && sudo apt install -y ./*.deb'
You may run the ubuntu example or cuda example to test the installation. But a simple enroot version
is usually sufficient.
Multi-node Setup¶
On all nodes, edit Enroot config for shared container file system (assumes Bash shell):
# run once for all nodes (including head node)
IP=<IP>
# You may edit /etc/enroot/enroot.conf directly, but the following idempotent commands are recommended for consistency
# Set ENROOT_DATA_PATH to /mnt/home/${USER}/enroot/data
ssh $IP "sudo grep -q '^ENROOT_DATA_PATH[[:space:]]\+/mnt/home/${USER}/enroot/data\$' /etc/enroot/enroot.conf || sudo sed -i '/^#ENROOT_DATA_PATH[[:space:]]\+\\\${XDG_DATA_HOME}\/enroot\$/a ENROOT_DATA_PATH /mnt/home/${USER}/enroot/data' /etc/enroot/enroot.conf"
# Set ENROOT_MOUNT_HOME to yes
ssh $IP "sudo grep -q '^ENROOT_MOUNT_HOME[[:space:]]\+yes\$' /etc/enroot/enroot.conf || sudo sed -i '/^#ENROOT_MOUNT_HOME[[:space:]]\+no\$/a ENROOT_MOUNT_HOME yes' /etc/enroot/enroot.conf"
On head node, add Enroot hook for OpenMPI:
sudo tee /etc/enroot/hooks.d/ompi.sh > /dev/null << 'EOF'
#!/bin/bash
echo "OMPI_MCA_orte_launch_agent=enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app ${ENROOT_ROOTFS##*/} orted" >> "${ENROOT_ENVIRON}"
EOF
sudo chmod +x /etc/enroot/hooks.d/ompi.sh
Download and Create Container Image¶
On head node, download NGC HPC-Benchmarks container image to the shared directory:
cd /mnt/home/${USER}/enroot
mkdir -p sqsh && cd sqsh
enroot import docker://nvcr.io#nvidia/hpc-benchmarks:25.04
ls ./nvidia+hpc-benchmarks+25.04.sqsh
On head node, create container (the created container will be visible on all nodes due to the ENROOT_DATA_PATH
setting we set earlier):
cd /mnt/home/${USER}/enroot
mkdir -p data && cd sqsh
enroot create --name hpc-benchmarks-25-04 nvidia+hpc-benchmarks+25.04.sqsh
ls ../data/hpc-benchmarks-25-04
enroot list
# Single node MPI quick test
enroot start hpc-benchmarks-25-04 mpirun hostname
Create a workspace directory, and store the hostfile there (for multi-node tasks, assuming all nodes have 8 GPUs):
cd /mnt/home/${USER}/enroot
mkdir -p workspace && cd workspace
IP_LIST=("<IP1>" "<IP2>")
for IP in "${IP_LIST[@]}"; do
echo "$IP slots=8" >> hosts.txt
done
cat hosts.txt
Run multi-node quick test (assuming 2 nodes with 8 GPUs each):
enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app hpc-benchmarks-25-04 mpirun -np 16 --hostfile /app/hosts.txt hostname
# should see 8 hostnames for each node
Note: The command prefix before the container name (i.e.,
enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app
) must match exactly what is set in the/etc/enroot/hooks.d/ompi.sh
hook. Do not modify this part of the command, or the multi-node launch will not work correctly. You can change everything after the container name (e.g.,mpirun ...
) though. In addition, it is highly recommended to use absolute paths in the command.
Running HPL¶
Prepare a suitable HPL.dat
file for your machine.
enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app hpc-benchmarks-25-04
# in the container
cp hpl-linux-x86_64/sample-dat/HPL-H200-8GPUs.dat /app/
cp hpl-linux-x86_64/sample-dat/HPL-H200-16GPUs.dat /app/
# Ctrl+D to exit the container
Test single node HPL:
enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app hpc-benchmarks-25-04 mpirun -np 8 ./hpl.sh \
--dat /app/HPL-H200-8GPUs.dat
The result may not be optimal. You may tune the dat file, mpirun flags, and environment variables according to your machine for better HPL performance.
Test multi-node HPL:
enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app hpc-benchmarks-25-04 mpirun -np 16 \
--hostfile /app/hosts.txt \
--mca mca_base_env_list "HPL_USE_NVSHMEM=0" \
./hpl.sh \
--dat /app/HPL-H200-16GPUs.dat
The result may not be optimal. You may tune the dat file, mpirun flags, and environment variables according to your machine for better HPL performance.
NVSHMEM is disabled by default to provide less assumptions about the network hardware. See: How to run HPL script over Ethernet.
(To be verified) Theoretically, you can enable NVSHMEM (remove
HPL_USE_NVSHMEM=0
) if you are using InfiniBand and have correctly set up GPUDirect RDMA (DMA-BUD, or nvidia-peermem (legacy) installed with.run
driver install, or nv_peer_memory (legacy) on GitHub) on all nodes. See more at: NVSHMEM requirements. I'm not sure if NVSHMEM is supported over Ethernet though...
All done! Now you can use Enroot to run your multi-node tasks with ease!
FAQ¶
What is Inside the HPC-Benchmarks Container?¶
All software other than system-level drivers and kernel modules are included in the container.
NVIDIA HPC-Benchmarks 25.04 includes:
- Sample files such as HPL dat files.
- HPL, HPL-MxP, HPCG, STREAM
- NCCL, NVSHMEM, GDR Copy
- NVIDIA Optimized Frameworks 25.01
- including: CUDA, cuBLAS, cuDNN, cuTENSOR, DALI, NCCL, TensorRT, rdma-core, NVIDIA HPC-X (OpenMPI, UCX), Nsight Compute, Nsight Systems, and more (by searching 25.01
).
So basically all software other than those listed in the Sample Environment is included in the container.
For sanity check:
enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app hpc-benchmarks-25-04
# in the container
ucx_info -v
ompi_info | grep "MPI extensions"
# ...
# Ctrl+D to exit the container
You can see that both UCX and OpenMPI are built with CUDA support, even though you may not have installed UCX, OpenMPI, or even CUDA on the host OS.
How does the Multi-node Setup Work?¶
To the best of my knowledge, this Enroot multi-node setup (or hack) is first introduced by @3XX0 in this issue.
Aside from normal single-node Enroot setup, there are four major points in Multi-node Setup:
-
Setting
ENROOT_DATA_PATH
to a NFS shared directory in/etc/enroot/enroot.conf
.
This path is used to store the container file system (unpacked byenroot create
). Setting it to a NFS shared directory ensures that the container file system is visible (byenroot list
) on all nodes once being created. Without this option, user need to manually runenroot create
on each node, which is tedious and error-prone. Executingenroot remove
will delete the container file system from this path. (Reference) -
Setting
ENROOT_MOUNT_HOME
toyes
in/etc/enroot/enroot.conf
.
Mounting the home directory allows the container to access the~/.ssh
folder. This is necessary for MPI (mpirun
) to automatically use password-less SSH authentication to launchorted
processes on all nodes. (Reference) -
Setting
OMPI_MCA_orte_launch_agent
toenroot start ... orted
.
Setting theOMPI_MCA_orte_launch_agent
environment variable is a common trick to makempirun
launch theorted
process within a (Enroot/Singularity) container. Basically it tellsmpirun
to runenroot start ... orted
instead of runningorted
directly. -
Adding a (executable) hook for OpenMPI in
/etc/enroot/hooks.d/ompi.sh
.
This hook removes the need of manually settingOMPI_MCA_orte_launch_agent
environment variable every time you run a task viaenroot start
. In our case, without this hook will require running the following everytime:Adding the following pre-start hook script:enroot start -e OMPI_MCA_orte_launch_agent='enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app ${CONTAINER_NAME} orted' --rw --mount /mnt/home/${USER}/enroot/workspace:/app ${CONTAINER_NAME} mpirun -np 16 --hostfile hosts.txt ...
simplifies the command to:#!/bin/bash echo "OMPI_MCA_orte_launch_agent=enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app ${ENROOT_ROOTFS##*/} orted" >> "${ENROOT_ENVIRON}"
which makes life easier. (Reference)enroot start --rw --mount /mnt/home/${USER}/enroot/workspace:/app ${CONTAINER_NAME} mpirun -np 16 --hostfile hosts.txt ...
Why not Use the OpenMPI on Host OS?¶
Running mpirun ... enroot start ...
may prevent intra-node optimizations, resulting in worse performance. In addition, using OpenMPI in the Enroot container makes life easier, as we don't even need to install OpenMPI on any node.
Limitations¶
- This approach is less robust compared to using Pyxis and Slurm.
- Requires specifying fixed Enroot flags in the pre-start hook.
- The examples provided are tailored for a single-user cluster. Some adjustments will be required to support multiple users running Enroot containers concurrently.
Hopefully I'll have time to get back to fix this later.
Acknowledgments¶
This note has been made possible through the support of ElsaLab and NVIDIA AI Technology Center (NVAITC).
Thanks to ElsaLab HPC Study Group and especially Kuan-Hsun Tu for environment setup.
And of course, thanks to @3XX0 for sharing this workaround.