I am testing an HGX 4xA100 system for my company. According to Powerful Server Platform for AI & HPC | NVIDIA HGX A100, the HGX 4xA100 system supports a total aggregate bandwidth of 2.4 TB/s. But I cannot understand how this number is computed, it doesn’t make sense considering the specs of the NVLink connections.

According to the developer blog “Introducing NVIDIA HGX A100: The Most Powerful Accelerated Server Platform for AI and High Performance Computing”, each A100 GPU has 12 NVLink connections. This would mean the HGX A100 has 4 NVLinks between any two GPU’s, and 24 NVLinks in the system total, which is what I observe when I run nvidia-smi topo -m. A100 systems use 3rd generation NVLink, which supports 25 GB/s single direction bandwidth.

I can see how this gives rise to 600 GB/s bandwidth between any given pair of GPU’s if we transfer the data bidirectionally in 3 parallel paths (1 direct path through the 4 NVLinks connecting the GPU’s, and 2 indirect paths each using 8 NVLinks that first pass through the other GPU’s before converging onto the destination).

But 4 * 12 * 25 = 1200 GB/s, and this number is already the bidirectional bandwidth since each edge between two GPU’s is double counted this way.

When I run the alltoall benchmark in nccl-tests, I get an average bus bandwidth of 220GB/s. Even if we assume that means the theoretical bus bandwidth was 300GB/s, this also ends up being consistent with an aggregate system bandwidth of 1200GB/s, not 2400GB/s.

Is there something wrong in my interpretation of aggregate bandwidth, or in my understanding of NVLink?