Getting the Most Out of the NVIDIA A100 GPU with Multi-Instance GPU

Originally published at: https://developer.nvidia.com/blog/getting-the-most-out-of-the-a100-gpu-with-multi-instance-gpu/

With the third-generation Tensor Core technology, NVIDIA recently unveiled A100 Tensor Core GPU that delivers unprecedented acceleration at every scale for AI, data analytics, and high-performance computing. Along with the great performance increase over prior generation GPUs comes another groundbreaking innovation, Multi-Instance GPU (MIG). With MIG, each A100 GPU can be partitioned up to seven…

I have followed all the instructions referred in the MIG Manual , however, when I run “sudo nvidia-smi mig -cgi 9,3g.20gb -C”, it turns out to be
Option “-C” is not recognized.
How should I solve this problem?
And without the “-C” option, though I can find the MIGs by “nvidia-smi mig -lgi”, but neither can I get it through “nvidia-smi” nor “ls -l /proc/driver/nvidia/capabilities/gpu1/mig/gi*”
What should I do with this problem?

Hi ryy19

Option “-C” is not recognized.

As mentioned in the software pre-requisites, are you running at least R450.80.02 as the driver version for A100? The “-C” option is only available starting with this driver version.

but neither can I get it through “nvidia-smi” nor “ls -l /proc/driver/nvidia/capabilities/gpu1/mig/gi*”

Can you please provide more information on what you’re not able to see? MIG devices once created can be accessed either through “nvidia-smi -L” or “nvidia-smi mig -lgi”

hi, is there a good way for users without sudo rights to use the MIG functionality? I think running multiple scripts in parallel on the same A100 sounds very interesting, but it needs to work without admin rights (at least after the admin has enabled MIG on the GPU).

is there a way to do that?

Hi @mikkelsen.kaare - not today. We expect that clusters with A100 GPUs are configured in desired MIG geometries - the configurations can be static (a priori by the infra team) or dynamic (using a systemd service for example as nodes are brought online when used in an autoscaler environment). We have created tooling that can be used for these purposes.

Please check this project for a declarative way to create the desired MIG geometries: https://github.com/nvidia/mig-parted and the associated systemd service that can be used in conjunction with provisioning nodes: https://github.com/NVIDIA/mig-parted/tree/master/deployments/systemd. We expect that these tools be used instead of nvidia-smi commands, which can be error prone when used in a production environment. Hope these are useful.

1 Like

Hi,
whether the A100 GPU with Multi-Instance GPU (MIG) allow users to set the application clock (graphic or memory) for a specific GPU instance? Or when we set the application clock via Nvidia-smi, it applies to all instances within the GPU.

Hi @kz181, it should apply to all MIG instances, as all MIG instances share a single clock and power limit.

is it possible to enumerate multiple MIG compute instances?

For example, can I pass the UUIDs of multiple compute instances of MIG as the CUDA_VISIABLE_DEVICES or --gpus for the docker, such that my program or docker container can find those MIG GPU devices and using the cudaSetDevice to index them by number, such as 0,1,2 for three different compute instances?

Thanks!

Under “single” strategy, “num_gpus” doesn’t work. It always uses one MIG device.
python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=64 --model=resnet50 --use_fp16

Hi,
When multiple users log into the same A100 (linux ssh), how to allocate the MIG GPUs to each so that one user does not step on to other user’s GPU slices? Lets say we use 3g.20GB i.e. each A100 GPU is split into 2 slices. So totally 16 slices are available. There are device ids now uuids. Is there a way to allocate devices to individual users? @maggiez @chetantekur

Is MIG meant for only docker containers? Can multiple users ssh directly to the VM and use ?

CUDA_VISIBLE_DEVICES, I am not sure whether this could help you.

1 Like