A100 MIG inter-instance communication

li.baol · September 29, 2021, 2:28pm

From the MIG documentation (NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation), it says MIG does not allow GPU-GPU communication. Does it mean that a MIG instance in GPU 1 cannot communicate with a MIG instance in GPU 2? What about intra-GPU, can two MIG instances in the same GPU communicate with each other?

I’m particularly interested in workloads such as distributed DL training. How can MIG instances be used for distributed training?

atetelman · September 29, 2021, 6:18pm

The relevant notes from the documentation are below:

No GPU to GPU P2P (either PCIe or NVLink) is supported
…

GPUDirect RDMA is supported when used from GPU Instances

Essentially, what this means is that P2P GPUDirect is not supported by any MIG devices. This applies if you have two MIG devices on the same physical GPU or two MIG devices on different GPUs connected via NVLink or NVSwitch.

You could potentially do distributed training using GPUDirect RDMA, but this would require routing all communications out of the GPU and over the network. This will be much less efficient than simply using a larger MIG device or disabling MIG.

Depending on your use case here, best practice might just be to disable MIG on a subset of GPU on your system. Say you have a DGX Station with 4 A100 GPUs, you can have two GPUs with MIG disabled reserved for larger distributed training jobs and the other two GPUs MIG-enabled used for smaller training jobs and notebook development.

In general, MIG is best suited for workloads that are small enough that they do not require the full resources of a GPU. If you are looking into distributed training, it sounds like your workload may justify and benefit from running on a full GPU rather than a MIG.

If you haven’t already seen it, the mig-parted tools provides a seamless interface to dynamically re-configure MIG devices as-needed to your current workloads.

Topic		Replies	Views
For MIG on A100 and A30, can different GPU instance communicate? CUDA Programming and Performance	1	494	January 2, 2023
How to sharing GPU Memory between different GPU instance in MIG? DGX User Forum	0	732	March 25, 2022
MIG with KVM VM NVIDIA Virtual GPU Drivers	11	6659	September 27, 2023
Multi GPU training using MIG Frameworks	0	787	May 2, 2023
A100 MIG supported hypervisors NVIDIA Virtual GPU Technology	2	1555	October 9, 2023
MIG's multi-Compute Instance (CI) Use case? CUDA Programming and Performance	2	438	November 10, 2020
Do H100 in MIG mode support graphics api General Discussion	0	331	June 25, 2024
Is GPU memory oversubscription allowed in MIG? CUDA Programming and Performance	4	402	February 23, 2023
Does RTX A5500 GPU support Multi-Instance GPU (MIG) CUDA Setup and Installation cuda	1	553	June 26, 2023
MIG load balancing CUDA Setup and Installation	3	594	June 28, 2023

A100 MIG inter-instance communication

Related topics