This webpage is directly generated from the README of j3soon/runai-isaac. Please refer to the repository for additional information such as the Run:ai scripts.
A comprehensive guide for (1) setting up Run:ai with helper scripts, (2) running PyTorch, Isaac Sim, Isaac Lab, Cosmos, CUDA, and more workloads on Run:ai, and (3) using SSH, VNC, Jupyter Lab, VSCode, TensorBoard, Nsight Systems, Nsight Compute, and more tools on Run:ai.
For running Isaac Sim workloads on Omniverse Farm, please refer to j3soon/omni-farm-isaac. These two workload managers can be used together. Adding a Run:ai project with name ov-farm will allow Run:ai to act as a scheduler for Omniverse Farm.
For new users, we strongly recommend reading this entire guide and following the instructions step by step. You can skip optional sections and ignore links unless needed.
In the past, skipping this guide has led to serious issues including code and data loss when containers are terminated.
Only skip the guide if you are fully confident in what you're doing. Proceed at your own risk.
Note that this section is optional if you plan to use the Run:ai Dashboard directly.
However, you'll need to keep in mind the following secrets and use them accordingly.
These 4 scripts are just wrappers for the openvpn3 command line tool. See the official documentation for more details.
If you need to connect multiple machines to the VPN simultaneously, avoid using the same VPN profile. Doing so may cause one machine to disconnect when another connects. Consider asking the cluster admin to generate separate VPN profiles for each of your machine.
We strongly recommend following the instructions at least once to understand the cluster's logic. For example, any data stored outside the persistent NFS volume will be deleted when the container is terminated.
Pre-built Docker images for Isaac Sim, Isaac Lab, and other applications are described at the end of this document. However, we recommend following the instructions below at least once to familiarize yourself with the workflow.
Note that this step is optional if you are using our pre-built Docker images.
It is highly recommended to build your custom Docker images in a Linux environment (with NVIDIA Driver, Docker, and NVIDIA Container Toolkit installed). Building on Windows is strongly discouraged for beginners unless you know exactly what you are doing.
In this example, dependencies are not installed in the Dockerfile. However, in practice, you will want to select a suitable base image and pre-install all dependencies in the Dockerfile such as pip install -r requirements.txt to prevent the need of installing dependencies every time after launching a container. You may also want to delete the .dockerignore file. In addition, ensure that you always copy the run.sh file and the omnicli directory directly to the root directory (/) without any modifications, rather than placing them in other subdirectories. Failing to do so will result in errors, as the script relies on absolute paths. As a side note, if your code will not be modified, you can also directly copy the code to your Docker image. However, this is usually not the case, as you often want to update your code without rebuilding the Docker image.
Upload your dataset and code to storage node through FTP.
Note that some FileZilla installer may contain adware. Make sure the name of the installer does not container the word sponsored.
For FileZilla, enter the Host ${STORAGE_NODE_IP} in env.sh and enter the ${FTP_USER} and ${FTP_PASS} provided by the cluster admin. Also make sure to set Edit > Settings > Transfers > File Types > Default transfer type > Binary to prevent the endlines from being changed, see this post for more details.
For lftp, on your local machine run:
sourcesecrets/env.sh# Install and set up lftpsudoapt-getupdate&&sudoapt-getinstall-ylftp
echo"set ssl:verify-certificate no">>~/.lftprc
# Connect to storage nodelftp-u${FTP_USER},${FTP_PASS}${STORAGE_NODE_IP}
Inside the lftp session, run:
cd/mnt/nfs
ls
mkdir<YOUR_USERNAME>
cd<YOUR_USERNAME>
# Delete old dataset and coderm-rdata
rm-rmnist
# Upload dataset and codemirror--reverseexamples/datadata
mirror--reverseexamples/mnistmnist
# Don't close this session just yet, we will need it later
When uploading a newer version of your code or dataset, always delete the existing directory first. This ensures that any files removed in the new version are not left behind. If you expect you will run a newer version of your code while previous tasks are still running, consider implementing a versioning system by including a version tag in the file path to prevent conflict.
Create a new environment for your docker image.
Go to Workload manager > Assets > Environments and click + NEW ENVIRONMENT.
Fill in the following fields:
Scope
runai/runai-cluster/<YOUR_LAB>/<YOUR_PROJECT>
Environment name
<YOUR_USERNAME>-pytorch-mnist
Workload architecture & type
Select the type of workload that can use this environment:
Set where the UID, GID, and supplementary groups for the container should be taken from
From the image
In newer versions of Run:ai, the default value may be From the IdP token.
and then click CREATE ENVIRONMENT.
You should create a new environment for each docker image you want to use. In most cases, you will only need to create one environment. In addition, you can add more tools to the environment, such as TensorBoard, or opening custom ports using the Custom tool and NodePort connection type.
Security settings for later versions of Run:ai (Click to expand)
Create a new GPU workload based on the environment.
Go to Workload manager > Workloads and click + NEW WORKLOAD > Workspace.
Fill in the following fields:
Workspace name
<YOUR_USERNAME>-pytorch-mnist-test1
and click CONTINUE.
Environment
Select the environment for your workload:
<YOUR_USERNAME>-pytorch-mnist
(Optional) Set the connection for your tool(s):
Jupyter Access: Set to Specific user(s)
This optional step is not included in the screenshot below.
Compute resource
Select the node resources needed to run your workload:
gpu-x1
Data sources
Select the data sources your workload needs to access:
<YOUR_LAB>-nfs
General
Set the backoff limit before workload failure:
Attempts: 1
and then click CREATE WORKSPACE.
Make sure to not accidentally select the default jupyter-lab environment. If you do, you'll see a jovyan user instead of root. In such case, recreate the workload with the correct environment <YOUR_USERNAME>-pytorch-mnist.
In our case, we didn't limit the Jupyter access to specific users, so anyone can access the Jupyter Lab.
The /run.sh file mentioned here is the same run.sh script that was copied directly into the Docker image without any modifications during the second step. This pre-written helper script streamlines file downloads and uploads to and from Nucleus while also supporting the sequential execution of multiple commands.
Connect to the Jupyter Lab.
In Workload manager > Workloads, select the workload you just created and click CONNECT > Jupyter and click Terminal.
Extract the dataset.
In the Jupyter Lab terminal, run:
cd/mnt/nfs/<YOUR_USERNAME>/data/MNIST/raw
ls
gzip-dktrain-images-idx3-ubyte.gz
gzip-dktrain-labels-idx1-ubyte.gz
gzip-dkt10k-images-idx3-ubyte.gz
gzip-dkt10k-labels-idx1-ubyte.gz
ls
Although /mnt/nfs is a Network File System (NFS) mounted volume, it typically isn't the bottleneck during training. However, if you notice that your dataloader is causing performance issues, consider copying the dataset to the container's local storage before starting the training process. The NFS volume may also cause issues if you are using tar on the mounted volume, make sure to use the --no-same-owner flag to prevent the tar: XXX: Cannot change ownership to uid XXX, gid XXX: Operation not permitted error.
The apt-get install and pip install commands here are only for demonstration purposes, installing packages during runtime is not recommended, as it can slow down the task and potentially cause issues. It is recommended to include all dependencies in the Docker image by specifying them in the Dockerfile.
Make sure to store all checkpoints and output files in /mnt/nfs. Otherwise, after the container is terminated, all files outside of /mnt/nfs (including the home directory) will be permanently deleted. This is because containers are ephemeral and only the NFS mount persists between runs.
Download the results.
Inside the previous lftp session, run:
cd/mnt/nfs/<YOUR_USERNAME>/mnist
cacheflush
ls
# Download the resultsgetmnist_cnn.pt
rmmnist_cnn.pt
Make sure to delete the results after downloading to save storage space.
Delete the workload.
Go to Workload manager > Workloads and select the workload you just created and click DELETE. Please always STOP or DELETE the workload after you are done with the task to allow maximum resource utilization.
Alternative to interactive Jupyter Lab workloads, you may want to submit a batch workload.
Go to Workload manager > Workloads and click + NEW WORKLOAD > Workspace.
Fill in the following fields:
- Workspace name
<YOUR_USERNAME>-pytorch-mnist-test2
and click `CONTINUE`.
- Environment
- Select the environment for your workload:
<YOUR_USERNAME>-pytorch-mnist
- Set a command and arguments for the container running in the pod:
- Command
- Compute resource
- Select the node resources needed to run your workload:
gpu-x1
- Data sources
- Select the data sources your workload needs to access:
<YOUR_LAB>-nfs
- General
- Set the backoff limit before workload failure:
Attempts: 1
and then click CREATE WORKSPACE.
Note that the batch workload will automatically restart once when it fails since we set the backoff limit to 1. There is currently no way to set the backoff limit to 0, so make sure a workload restart will not overwrite your previous results.
After the workload is completed, click SHOW DETAILS to see the logs.
Similar to the interactive workload, you should see the checkpoint and output files at /mnt/nfs/<YOUR_USERNAME>/mnist/mnist_cnn.pt through FTP.
Make sure to always add your username as a prefix to your environment name and workload name. This helps preventing others from accidentally modifying your setups.
For downloading large files or directories, consider using tar with pigz to compress the files in parallel. See tar + pigz and tar + pv + pigz for examples.
As a side note, you may want to use Wandb to log your training results. This allows you to visualize your training progress of all your workloads in a single dashboard.
Now that you have a basic understanding of the workflow, here are a few tips to help you work more efficiently:
Build and test locally first. Always create your custom Docker image on a local Linux machine and test it there before deploying to Run:ai. This makes debugging easier and prevents wasting GPU resources on Run:ai.
Use persistent storage wisely. Store all code and data in the persistent NFS volume, back them up regularly to your local machine, and remove unnecessary files to save shared storage space on Run:ai. To minimize performance impact, copy the dataset to the container's local storage before starting the training process, and reduce checkpointing frequency.
Prefer batch workloads. When possible, use batch workloads so containers terminate automatically after tasks complete, freeing GPU resources for others.
Use interactive Jupyter Lab only when needed. Reserve interactive workloads for debugging, and always stop or delete them when finished to release the resources. Depending on your cluster policy, idle interactive workloads may be automatically terminated without warning after a set time or during maintenance. Keeping an idle interactive workload running for days is often frowned upon, unless you have contacted the cluster admin and received explicit permission.
Request for minimal GPU resources. If you are not sure about the minimum GPU resources required for your task, request for minimal resources (gpu-x1) first. You can always request for more resources (e.g., gpu-x2, gpu-x4, gpu-x8) later. In addition, don't submit CPU workloads (gpu-x0, cpu-only) on a GPU node pool unless you have contacted the cluster admin and received explicit permission.
For more sample applications (such as Isaac Sim and Isaac Lab), please refer to the Applications section.