Create MIG Instances with NVML Go Bindings¶

This webpage is directly generated from the README of j3soon/go-nvml-mig-create-instance. Please refer to the repository for additional information such as example Go code.

Unofficial example on creating Multi-Instance GPU (MIG) instances with NVIDIA Management Library (NVML) Go bindings.

Prerequisites:

Take A30 as an example:

Clone this repo and cd into it:

git clone https://github.com/j3soon/go-nvml-mig-create-instance.git
cd go-nvml-mig-create-instance

Launch docker container for Go:
```
docker run --rm -it --gpus all \
    -v $(pwd):/workspace \
    --cap-add=SYS_ADMIN \
    -e NVIDIA_MIG_CONFIG_DEVICES=all \
    golang
# in the container
cd /workspace
```
Note: --runtime=nvidia, -e NVIDIA_VISIBLE_DEVICES=all, and -e NVIDIA_DRIVER_CAPABILITIES=all may be required depending on your environment and use cases.
Alternatively, you can install Go on your host machine and skip this step.

Run the example and observe results:

go run main.go
# List the available CIs and GIs
nvidia-smi mig -lgi; nvidia-smi mig -lci;
# Destroy all the CIs and GIs
nvidia-smi mig -dci; nvidia-smi mig -dgi;

This should also work on A100/H100/H200 by substituting the MIG profile to a supported one.

Description¶

To create MIG GIs and CIs, we should get instance profile information and then create instances based on the profile.

Example Code Snippet¶

// Assuming the 0-th device is MIG-enabled
device, ret := nvml.DeviceGetHandleByIndex(0)
// Create GPU Instance
giProfileInfo, ret := device.GetGpuInstanceProfileInfo(nvml.GPU_INSTANCE_PROFILE_4_SLICE)
gi, ret := device.CreateGpuInstance(&giProfileInfo)
// Create Compute Instance
ciProfileInfo, ret := gi.GetComputeInstanceProfileInfo(nvml.COMPUTE_INSTANCE_PROFILE_2_SLICE, nvml.COMPUTE_INSTANCE_ENGINE_PROFILE_SHARED)
_, ret = gi.CreateComputeInstance(&ciProfileInfo)

The following source code references are based on go-nvml v0.12.4-1

Creating GPU Instances (GIs)¶

After getting the device handle of the 0-th GPU, we want to create a GI based on CreateGpuInstance. Take a look at its Go binding (ref: cpp, go, src):
```
CreateGpuInstance(*GpuInstanceProfileInfo) (GpuInstance, Return)
```

We can see that it takes the reference of GpuInstanceProfileInfo as the argument. Take a look at its source (ref: cpp, go, src):

/**
 * GPU instance profile information.
 */
typedef struct nvmlGpuInstanceProfileInfo_st
{
    unsigned int id;                  //!< Unique profile ID within the device
    unsigned int isP2pSupported;      //!< Peer-to-Peer support
    unsigned int sliceCount;          //!< GPU Slice count
    unsigned int instanceCount;       //!< GPU instance count
    unsigned int multiprocessorCount; //!< Streaming Multiprocessor count
    unsigned int copyEngineCount;     //!< Copy Engine count
    unsigned int decoderCount;        //!< Decoder Engine count
    unsigned int encoderCount;        //!< Encoder Engine count
    unsigned int jpegCount;           //!< JPEG Engine count
    unsigned int ofaCount;            //!< OFA Engine count
    unsigned long long memorySizeMB;  //!< Memory size in MBytes
} nvmlGpuInstanceProfileInfo_t;

We suspect that these information isn't meant to be filled by hand. We should check the source for using the GetGpuInstanceProfileInfo API to retrieve these information (ref: cpp, go, src):

/**
 * Get GPU instance profile information
 *
 * Information provided by this API is immutable throughout the lifetime of a MIG mode.
 *
 * For Ampere &tm; or newer fully supported devices.
 * Supported on Linux only.
 *
 * @param device                               The identifier of the target device
 * @param profile                              One of the NVML_GPU_INSTANCE_PROFILE_*
 * @param info                                 Returns detailed profile information
 *
 * @return
 *         - \ref NVML_SUCCESS                 Upon success
 *         - \ref NVML_ERROR_UNINITIALIZED     If library has not been successfully initialized
 *         - \ref NVML_ERROR_INVALID_ARGUMENT  If \a device, \a profile or \a info are invalid
 *         - \ref NVML_ERROR_NOT_SUPPORTED     If \a device doesn't support MIG or \a profile isn't supported
 *         - \ref NVML_ERROR_NO_PERMISSION     If user doesn't have permission to perform the operation
 */
nvmlReturn_t DECLDIR nvmlDeviceGetGpuInstanceProfileInfo(nvmlDevice_t device, unsigned int profile,
                                                         nvmlGpuInstanceProfileInfo_t *info);

Seems like we need to pass a NVML_GPU_INSTANCE_PROFILE_* as the profile argument. Let's view the source (ref: cpp, go, src):

/**
 * GPU instance profiles.
 *
 * These macros should be passed to \ref nvmlDeviceGetGpuInstanceProfileInfo to retrieve the
 * detailed information about a GPU instance such as profile ID, engine counts.
 */
#define NVML_GPU_INSTANCE_PROFILE_1_SLICE      0x0
#define NVML_GPU_INSTANCE_PROFILE_2_SLICE      0x1
#define NVML_GPU_INSTANCE_PROFILE_3_SLICE      0x2
#define NVML_GPU_INSTANCE_PROFILE_4_SLICE      0x3
#define NVML_GPU_INSTANCE_PROFILE_7_SLICE      0x4
#define NVML_GPU_INSTANCE_PROFILE_8_SLICE      0x5
#define NVML_GPU_INSTANCE_PROFILE_6_SLICE      0x6
#define NVML_GPU_INSTANCE_PROFILE_1_SLICE_REV1 0x7
#define NVML_GPU_INSTANCE_PROFILE_2_SLICE_REV1 0x8
#define NVML_GPU_INSTANCE_PROFILE_1_SLICE_REV2 0x9
#define NVML_GPU_INSTANCE_PROFILE_COUNT        0xA

Please note that the NVML_GPU_INSTANCE_PROFILE_COUNT here is only a trick to get the number of profiles. It is not meant to be used as a profile.

We can see that our hypothesis is correct based on the comments. We use NVML_GPU_INSTANCE_PROFILE_4_SLICE in our example.

Creating Compute Instances (CIs)¶

After creating a GI, we want to create a CI based on CreateComputeInstance. Take a look at its Go binding (ref: cpp, go, src):
```
GpuInstanceCreateComputeInstance(GpuInstance, *ComputeInstanceProfileInfo) (ComputeInstance, Return)
```

Similar to the case in creating GIs, we'll need a ComputeInstanceProfileInfo. Let's look at its source (ref: cpp, go, src):

/**
 * Compute instance profile information.
 */
typedef struct nvmlComputeInstanceProfileInfo_st
{
    unsigned int id;                    //!< Unique profile ID within the GPU instance
    unsigned int sliceCount;            //!< GPU Slice count
    unsigned int instanceCount;         //!< Compute instance count
    unsigned int multiprocessorCount;   //!< Streaming Multiprocessor count
    unsigned int sharedCopyEngineCount; //!< Shared Copy Engine count
    unsigned int sharedDecoderCount;    //!< Shared Decoder Engine count
    unsigned int sharedEncoderCount;    //!< Shared Encoder Engine count
    unsigned int sharedJpegCount;       //!< Shared JPEG Engine count
    unsigned int sharedOfaCount;        //!< Shared OFA Engine count
} nvmlComputeInstanceProfileInfo_t;

Similarly, let's check the source for GetComputeInstanceProfileInfo API (ref: cpp, go, src):

/**
 * Get compute instance profile information.
 *
 * Information provided by this API is immutable throughout the lifetime of a MIG mode.
 *
 * For Ampere &tm; or newer fully supported devices.
 * Supported on Linux only.
 *
 * @param gpuInstance                          The identifier of the target GPU instance
 * @param profile                              One of the NVML_COMPUTE_INSTANCE_PROFILE_*
 * @param engProfile                           One of the NVML_COMPUTE_INSTANCE_ENGINE_PROFILE_*
 * @param info                                 Returns detailed profile information
 *
 * @return
 *         - \ref NVML_SUCCESS                 Upon success
 *         - \ref NVML_ERROR_UNINITIALIZED     If library has not been successfully initialized
 *         - \ref NVML_ERROR_INVALID_ARGUMENT  If \a gpuInstance, \a profile, \a engProfile or \a info are invalid
 *         - \ref NVML_ERROR_NOT_SUPPORTED     If \a profile isn't supported
 *         - \ref NVML_ERROR_NO_PERMISSION     If user doesn't have permission to perform the operation
 */
nvmlReturn_t DECLDIR nvmlGpuInstanceGetComputeInstanceProfileInfo(nvmlGpuInstance_t gpuInstance, unsigned int profile,
                                                                  unsigned int engProfile,
                                                                  nvmlComputeInstanceProfileInfo_t *info);

We should pass a NVML_COMPUTE_INSTANCE_PROFILE_* as the first (profile) argument. Let's view the source (ref: cpp, go, src):

/**
 * Compute instance profiles.
 *
 * These macros should be passed to \ref nvmlGpuInstanceGetComputeInstanceProfileInfo to retrieve the
 * detailed information about a compute instance such as profile ID, engine counts
 */
#define NVML_COMPUTE_INSTANCE_PROFILE_1_SLICE       0x0
#define NVML_COMPUTE_INSTANCE_PROFILE_2_SLICE       0x1
#define NVML_COMPUTE_INSTANCE_PROFILE_3_SLICE       0x2
#define NVML_COMPUTE_INSTANCE_PROFILE_4_SLICE       0x3
#define NVML_COMPUTE_INSTANCE_PROFILE_7_SLICE       0x4
#define NVML_COMPUTE_INSTANCE_PROFILE_8_SLICE       0x5
#define NVML_COMPUTE_INSTANCE_PROFILE_6_SLICE       0x6
#define NVML_COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1  0x7
#define NVML_COMPUTE_INSTANCE_PROFILE_COUNT         0x8

We use COMPUTE_INSTANCE_PROFILE_2_SLICE for the first argument in our example. As for the second argument (engProfile), let's also look at the source (ref: cpp, go, src):
```
#define NVML_COMPUTE_INSTANCE_ENGINE_PROFILE_SHARED 0x0 //!< All the engines except multiprocessors would be shared
#define NVML_COMPUTE_INSTANCE_ENGINE_PROFILE_COUNT  0x1
```
We can only use COMPUTE_INSTANCE_ENGINE_PROFILE_SHARED for the second argument in our example.

Although we currently only have the ability to share GPU engines (Copy Engine (CE), NVENC, NVDEC, NVJPEG, Optical Flow Accelerator (OFA), etc.) between CIs within the same GI, this struct may be extended to support isolating these engines for each CI within the same GI in the future.

References¶

Some references I found useful during the investigation.

API References (Useful for searching API definitions):

Acknowledgement¶

Thanks @Irene-Ting for discussions.