A100 MIG inter-instance communication

From the MIG documentation (NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation), it says MIG does not allow GPU-GPU communication. Does it mean that a MIG instance in GPU 1 cannot communicate with a MIG instance in GPU 2? What about intra-GPU, can two MIG instances in the same GPU communicate with each other?

I’m particularly interested in workloads such as distributed DL training. How can MIG instances be used for distributed training?

The relevant notes from the documentation are below:

  • No GPU to GPU P2P (either PCIe or NVLink) is supported
  • GPUDirect RDMA is supported when used from GPU Instances

Essentially, what this means is that P2P GPUDirect is not supported by any MIG devices. This applies if you have two MIG devices on the same physical GPU or two MIG devices on different GPUs connected via NVLink or NVSwitch.

You could potentially do distributed training using GPUDirect RDMA, but this would require routing all communications out of the GPU and over the network. This will be much less efficient than simply using a larger MIG device or disabling MIG.

Depending on your use case here, best practice might just be to disable MIG on a subset of GPU on your system. Say you have a DGX Station with 4 A100 GPUs, you can have two GPUs with MIG disabled reserved for larger distributed training jobs and the other two GPUs MIG-enabled used for smaller training jobs and notebook development.

In general, MIG is best suited for workloads that are small enough that they do not require the full resources of a GPU. If you are looking into distributed training, it sounds like your workload may justify and benefit from running on a full GPU rather than a MIG.

If you haven’t already seen it, the mig-parted tools provides a seamless interface to dynamically re-configure MIG devices as-needed to your current workloads.

1 Like