How to balance nvlink

Hi, there are many nvlinks between GPU, 12 nvlink for a100 DGX and 18 nvlink for h100 DGX.
In p2p communication, how to keep these nvlink balance or how split workload to keep nvlink balance?

  1. nvlink are bound to cta? For example, cta 0,1 use nvlink 0, and cta 2,3 use nvlink1.
  2. Memory address interleave to keep nvlink balance? address 0–64k use nvlink0, 64–128k use nvlink1?
  3. Some other Round Robin Stategy?

Maybe these stategy don’t export to software.

Thanks.

If I recall correctly, for all-to-all from the host the transfers should grouped by “distance” x

for(x = 0; x < numGpus; x++){
   for(i = 0; i < numGpus; i++){
        k = (i+x) % numGpus;
        send from i to k
   }
}

@striker159
I can’t understand your reply totally.

For example, there are 12 nvlink (through nvswitch) bewteen gpu0 to gpu1 in A100 DGX; Some data are sent from gpu0 to gpu1, all 12 nvlink should be used to improve throughput.

My question is how these 12 nvlink work cooperatively and how do the workload is split acorss 12 nvlink.

Thanks.

In the A100 and H100 platforms that use NVLink with NVLink switch arrays, I don’t think anything is specified about how the different links are utilized, nor is there anything you can do as a programmer to use one link or the other.

My suggested approach balances the nvlink utilization between the switch and the individual GPUs when sending data from each GPU to each other GPU using cudaMemcpyAsync.

As Robert_Crovella already confirmed, I don’t think there is more you can do to control utilization.

Hi, I have two H100 NVL and couldn’t activate NVLink on Ubuntu 22.04. Do you have any suggestions?

Thank you

I don’t know what “couldn’t activate NVLink” means.

I am unable to use both GPU at the same time. When I run LLM “watch nvidia-smi” shows that only single GPU runs. Also, it uses only 94GB VRAM.

Also, I don’t see VRAM as 188.

H100 NVL shows up in nvida-smi as two 94GB H100 GPUs. You can confirm proper NVLink behavior by running cuda sample codes such as simpleP2P or p2pBandwidthLatencyTest.

The reason for that is not related to NVLink. NVLink (and H100 NVL) does not automatically cause two GPUs to act as one, nor does it automatically combine two GPU such that the memory appears as a single unified resource. It requires programming methodologies to take advantage of both GPUs, and you would need to investigate how to do this for your software stack. For example, if using pytorch, you could investigate pytorch DDP or various other approaches, to use both GPUs in the H100 NVL product.

I won’t be able to help here with configuring your LLM software stack.

Many of the specifications listed in that product page are aggregate specifications, which you can get a hint of by reading footnotes 1 and 3. In this respect, descriptions are often similar to the way the K80 product (which consisted of two GPUs) was treated, some years ago.

1 Like