Running Multi-Node Tasks with Enroot (without Pyxis and Slurm)¶

This webpage is directly generated from j3soon/multi-node-enroot-without-pyxis-and-slurm.

Introduction¶

Setting up environments for user applications on an HPC cluster is often tedious and divert attention from the application itself. To tackle this, containerization support is a great way to simplify the process. HPC clusters often use Slurm workload manager along with containerization tools such as Singularity/Apptainer, Rootless Docker (environment module), or Enroot+Pyxis, for easier environment management.

Based on my experience working with Slurm and all these containerization options, I personally prefer Slurm with Enroot+Pyxis as it offers the simplest workflow for users familiar with Docker, while also ensuring minimal performance overhead.

The setup instructions are already documented in the official Pyxis repository. Enroot documentation also contains detailed usage guide on single-node tasks. However, there is no documentation for running multi-node tasks directly with Enroot without Pyxis. Using Enroot directly without Pyxis may be needed when you have direct (bare metal) access to multiple Ubuntu nodes and do not want to set up a scheduler or use a workload manager like Slurm. In such cases, Enroot itself alone can serve as a lightweight and effective containerization solution for HPC environments.

This (unofficial) document describes the minimal setup required for running multi-node tasks directly with Enroot without Pyxis and Slurm. Please note that running multi-node tasks with Enroot is more of a hack than a fool-proof solution, the recommended method for multi-node tasks remains using Enroot+Pyxis.

Sample Environment¶

GPU Hardware: Two nodes, each with eight H200 NVL GPUs

The commands below can be easily adapted to arbitrary number of nodes (or other NVIDIA GPUs).
Network Hardware: Each node is equipped with four ConnectX-7 NICs, providing eight InfiniBand connections per node through InfiniBand switches

Ethernet (RoCE) should also work in theory.
OS: Ubuntu 22.04.5 LTS

Other Linux distributions should also work.
Pre-installed Software: NVIDIA Driver, Docker, NVIDIA Container Toolkit, InfiniBand Driver (DOCA-Host or MLNX_OFED (legacy)), and optionally GDR Copy (Driver).

Not sure if Docker and Container Toolkit are required, but we have them installed by default on all nodes.
A running OpenSM on the InfiniBand switch or manually launched.

Usually it's already running on the InfiniBand switch.
All nodes have IP addresses assigned within a private network

Assume require VPN connection to access the nodes.

Prerequisites¶

Most multi-node cluster will have basic user account, NFS, and SSH configured. If not, you'll need to set up these first.

User Account¶

Create a user account (with same username/UID/GID) with sudo privileges on all nodes, with the home directory set to /mnt/home/<username>. If the user already exists, skip this step.

You'll want to use tools like LDAP to manage the user account. Alternatively, you can manually create the user account on all nodes:


# Create user account
USERNAME=<username>
sudo useradd -m -d /mnt/home/${USERNAME} -s /bin/bash -u 10001 -g 10001 -G sudo ${USERNAME}
# Enable password-less sudo
echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers

Network File System (NFS)¶

Set up an NFS server on the head node and mount the shared home directory on all other nodes. If NFS is already configured, ensure the necessary paths are exported and mounted correctly.

On head node:


# Install
sudo apt update
sudo apt install -y nfs-kernel-server
sudo systemctl start nfs-kernel-server.service
# Export
sudo mkdir -p /mnt/home/${USER}
sudo chown -R $(id -u):$(id -g) /mnt/home/${USER}
echo "/mnt/home     *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
echo "/opt/enroot   *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
sudo exportfs -a

On all other nodes:


# Install
sudo apt install -y nfs-common
# Mount
NFS_SERVER=<HEAD-NODE-IP>
sudo mkdir -p /mnt/home
sudo mkdir -p /opt/enroot
echo "$NFS_SERVER:/mnt/home /mnt/home nfs defaults 0 0" | sudo tee -a /etc/fstab
echo "$NFS_SERVER:/opt/enroot /opt/enroot nfs defaults 0 0" | sudo tee -a /etc/fstab
sudo mount -a
mount | grep nfs

SSH Configuration¶

Skip this step if password-less SSH is already configured.

On head node:


# Generate SSH key
ssh-keygen -t ed25519 # and press Enter multiple times to accept the default values
# Copy to shared home directory (will automatically work on all nodes due to shared home directory)
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys

Install Enroot¶

Download Enroot to the shared directory.

On head node, download Enroot deb files:


cd /mnt/home/${USER}
mkdir -p enroot/deb && cd ~/enroot/deb
arch=$(dpkg --print-architecture)
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.5.0/enroot_3.5.0-1_${arch}.deb
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.5.0/enroot+caps_3.5.0-1_${arch}.deb # optional

On all nodes, install Enroot:


# run once for all nodes (including head node)
IP=<IP>
ssh $IP 'cd ~/enroot/deb && sudo apt install -y ./*.deb'

You may run the ubuntu example or cuda example to test the installation. But a simple enroot version is usually sufficient.

Multi-node Setup¶

On all nodes, edit Enroot config for shared container file system (assumes Bash shell):


# run once for all nodes (including head node)
IP=<IP>
# You may edit /etc/enroot/enroot.conf directly, but the following idempotent commands are recommended for consistency
# Set ENROOT_DATA_PATH to /opt/enroot/data
ssh $IP "sudo grep -q '^ENROOT_DATA_PATH[[:space:]]\+/opt/enroot/data\$' /etc/enroot/enroot.conf || sudo sed -i '/^#ENROOT_DATA_PATH[[:space:]]\+\\\${XDG_DATA_HOME}\/enroot\$/a ENROOT_DATA_PATH           /opt/enroot/data' /etc/enroot/enroot.conf"
# Set ENROOT_MOUNT_HOME to yes
ssh $IP "sudo grep -q '^ENROOT_MOUNT_HOME[[:space:]]\+yes\$' /etc/enroot/enroot.conf || sudo sed -i '/^#ENROOT_MOUNT_HOME[[:space:]]\+no\$/a ENROOT_MOUNT_HOME          yes' /etc/enroot/enroot.conf"

On head node, create data/workspace directory and add Enroot hook for OpenMPI:


# Create data directory
sudo mkdir -p /opt/enroot/data
sudo chmod 1777 /opt/enroot/data
# Create workspace directory
sudo mkdir -p /opt/enroot/workspace
sudo chmod 1777 /opt/enroot/workspace
mkdir /opt/enroot/workspace/${USER}
# Add Enroot hook for OpenMPI
sudo tee /etc/enroot/hooks.d/ompi.sh > /dev/null << 'EOF'
#!/bin/bash

echo "OMPI_MCA_orte_launch_agent=enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${ENROOT_ROOTFS##*/} orted" >> "${ENROOT_ENVIRON}"
EOF
sudo chmod +x /etc/enroot/hooks.d/ompi.sh

Download and Create Container Image¶

On head node, download NGC HPC-Benchmarks container image to the shared directory:


cd /mnt/home/${USER}/enroot
mkdir -p sqsh && cd sqsh
enroot import docker://nvcr.io#nvidia/hpc-benchmarks:25.04
ls ./nvidia+hpc-benchmarks+25.04.sqsh

On head node, create container with current username as prefix (the created container will be visible on all nodes due to the ENROOT_DATA_PATH setting we set earlier):


cd /mnt/home/${USER}/enroot/sqsh
enroot create --name ${USER}-hpc-benchmarks-25-04 nvidia+hpc-benchmarks+25.04.sqsh
ls /opt/enroot/data/${USER}-hpc-benchmarks-25-04
enroot list
# Single node MPI quick test
enroot start ${USER}-hpc-benchmarks-25-04 mpirun hostname

Create a workspace directory, and store the hostfile there (for multi-node tasks, assuming all nodes have 8 GPUs):


cd /opt/enroot/workspace/${USER}
IP_LIST=("<IP1>" "<IP2>")
for IP in "${IP_LIST[@]}"; do
    echo "$IP slots=8" >> hosts.txt
done
cat hosts.txt

Run multi-node quick test (assuming 2 nodes with 8 GPUs each):


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 16 --hostfile /app/hosts.txt hostname
# should see 8 hostnames for each node

Note: The command prefix before the container name (i.e., enroot start --rw --mount /opt/enroot/workspace/${USER}:/app) must match exactly what is set in the /etc/enroot/hooks.d/ompi.sh hook. Do not modify this part of the command, or the multi-node launch will not work correctly. You can change everything after the container name (e.g., mpirun ...) though. In addition, it is highly recommended to use absolute paths in the command.

Running HPL¶

Prepare a suitable HPL.dat file for your machine.


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04
# in the container
cp hpl-linux-x86_64/sample-dat/HPL-H200-8GPUs.dat /app/
cp hpl-linux-x86_64/sample-dat/HPL-H200-16GPUs.dat /app/
# Ctrl+D to exit the container

Test single node HPL:


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 8 ./hpl.sh \
  --dat /app/HPL-H200-8GPUs.dat

The result may not be optimal. You may tune the dat file, mpirun flags, and environment variables according to your machine for better HPL performance.

Test multi-node HPL:


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 16 \
  --hostfile /app/hosts.txt \
  --mca mca_base_env_list "HPL_USE_NVSHMEM=0" \
  ./hpl.sh \
  --dat /app/HPL-H200-16GPUs.dat

The result may not be optimal. You may tune the dat file, mpirun flags, and environment variables according to your machine for better HPL performance.

NVSHMEM is disabled by default to provide less assumptions about the network hardware. See: How to run HPL script over Ethernet.

If you have NVSHMEM correctly set up, you can enable it by removing HPL_USE_NVSHMEM=0 in the mpirun command.


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 16 \
  --hostfile /app/hosts.txt \
  ./hpl.sh \
  --dat /app/HPL-H200-16GPUs.dat

If you are using NVSHMEM on InfiniBand network, you should have correctly set up GPUDirect RDMA (DMA-BUD, or nvidia-peermem (legacy) installed with .run driver install, or nv_peer_memory (legacy) on GitHub) on all nodes. Otherwise the multi-node launch will fail. See more at: NVSHMEM requirements.

All done! Now you can use Enroot to run your multi-node tasks with ease!

Feel free to skip the following sections and start running your tasks!

Running HPL-MxP¶

Print sample Slurm scripts for HPL-MxP:


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04
# in the container
cat hpl-mxp-linux-x86_64/sample-slurm/hpl-mxp-enroot-1N.sub
cat hpl-mxp-linux-x86_64/sample-slurm/hpl-mxp-enroot-2N.sub
# Ctrl+D to exit the container

Test single node HPL-MxP:


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 8 \
  --hostfile /app/hosts.txt \
  ./hpl-mxp.sh \
  --n 380000 --nb 2048 --nprow 4 --npcol 2 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7

Tuning would be required to achieve the best performance on your machine.

Test multi-node HPL-MxP:


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 16 \
  --hostfile /app/hosts.txt \
  ./hpl-mxp.sh \
  --n 530000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7

Tuning would be required to achieve the best performance on your machine.

FAQ¶

What is Inside the HPC-Benchmarks Container?¶

All software other than system-level drivers and kernel modules are included in the container.

NVIDIA HPC-Benchmarks 25.04 includes: - Sample files such as HPL dat files. - HPL, HPL-MxP, HPCG, STREAM - NCCL, NVSHMEM, GDR Copy (Library) - NVIDIA Optimized Frameworks 25.01 - including: CUDA, cuBLAS, cuDNN, cuTENSOR, DALI, NCCL, TensorRT, rdma-core, NVIDIA HPC-X (OpenMPI, UCX), Nsight Compute, Nsight Systems, and more (by searching 25.01).

So basically all software other than those listed in the Sample Environment is included in the container.

For sanity check:


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04
# in the container
ucx_info -v
ompi_info | grep "MPI extensions"
# ...
# Ctrl+D to exit the container

You can see that both UCX and OpenMPI are built with CUDA support, even though you may not have installed UCX, OpenMPI, or even CUDA on the host OS.

How does the Multi-node Setup Work?¶

To the best of my knowledge, this Enroot multi-node setup (or hack) is first introduced by @3XX0 in this issue.

Aside from normal single-node Enroot setup, there are four major points in Multi-node Setup:

Setting ENROOT_DATA_PATH to a NFS shared directory in /etc/enroot/enroot.conf.
This path is used to store the container file system (unpacked by enroot create). Setting it to a NFS shared directory ensures that the container file system is visible (by enroot list) on all nodes once being created. Without this option, user need to manually run enroot create on each node, which is tedious and error-prone. Executing enroot remove will delete the container file system from this path. (Reference)
Setting ENROOT_MOUNT_HOME to yes in /etc/enroot/enroot.conf.
Mounting the home directory allows the container to access the ~/.ssh folder. This is necessary for MPI (mpirun) to automatically use password-less SSH authentication to launch orted processes on all nodes. (Reference)
Setting OMPI_MCA_orte_launch_agent to enroot start ... orted.
Setting the OMPI_MCA_orte_launch_agent environment variable is a common trick to make mpirun launch the orted process within a (Enroot/Singularity) container. Basically it tells mpirun to run enroot start ... orted instead of running orted directly.

Adding a (executable) hook for OpenMPI in /etc/enroot/hooks.d/ompi.sh.
This hook removes the need of manually setting OMPI_MCA_orte_launch_agent environment variable every time you run a task via enroot start. In our case, without this hook will require running the following everytime:


enroot start -e OMPI_MCA_orte_launch_agent='enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${CONTAINER_NAME} orted' --rw --mount /opt/enroot/workspace/${USER}:/app ${CONTAINER_NAME} mpirun -np 16 --hostfile hosts.txt ...

Adding the following pre-start hook script:


#!/bin/bash

echo "OMPI_MCA_orte_launch_agent=enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${ENROOT_ROOTFS##*/} orted" >> "${ENROOT_ENVIRON}"

simplifies the command to:


enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${CONTAINER_NAME} mpirun -np 16 --hostfile hosts.txt ...

which makes life easier. (Reference)

Why not Use the OpenMPI on Host OS?¶

Running mpirun ... enroot start ... may prevent intra-node optimizations, resulting in worse performance. In addition, using OpenMPI in the Enroot container makes life easier, as we don't even need to install OpenMPI on any node.

Appendix¶

On host, run the following:

Check NVIDIA Driver
```
nvidia-smi
```

Check nv_peer_mem
```
lsmod | grep nv_peer_mem
```

Check nvidia_peermem
```
lsmod | grep nvidia_peermem
```

Check GDR Copy (Driver)
```
lsmod | grep gdrdrv
```

Limitations¶

This approach is less robust compared to using Pyxis and Slurm.
Requires specifying fixed Enroot flags in the pre-start hook.

Acknowledgments¶

This note has been made possible through the support of ElsaLab and NVIDIA AI Technology Center (NVAITC).

Thanks to ElsaLab HPC Study Group and especially Kuan-Hsun Tu for environment setup and discussion on multi-user support.

And of course, thanks to @3XX0 for sharing this workaround.