Interpretation of "total aggregate bandwidth" for HGX A100

I am testing an HGX 4xA100 system for my company. According to Powerful Server Platform for AI & HPC | NVIDIA HGX A100, the HGX 4xA100 system supports a total aggregate bandwidth of 2.4 TB/s. But I cannot understand how this number is computed, it doesn’t make sense considering the specs of the NVLink connections.

According to the developer blog “Introducing NVIDIA HGX A100: The Most Powerful Accelerated Server Platform for AI and High Performance Computing”, each A100 GPU has 12 NVLink connections. This would mean the HGX A100 has 4 NVLinks between any two GPU’s, and 24 NVLinks in the system total, which is what I observe when I run nvidia-smi topo -m. A100 systems use 3rd generation NVLink, which supports 25 GB/s single direction bandwidth.

I can see how this gives rise to 600 GB/s bandwidth between any given pair of GPU’s if we transfer the data bidirectionally in 3 parallel paths (1 direct path through the 4 NVLinks connecting the GPU’s, and 2 indirect paths each using 8 NVLinks that first pass through the other GPU’s before converging onto the destination).

But 4 * 12 * 25 = 1200 GB/s, and this number is already the bidirectional bandwidth since each edge between two GPU’s is double counted this way.

When I run the alltoall benchmark in nccl-tests, I get an average bus bandwidth of 220GB/s. Even if we assume that means the theoretical bus bandwidth was 300GB/s, this also ends up being consistent with an aggregate system bandwidth of 1200GB/s, not 2400GB/s.

Is there something wrong in my interpretation of aggregate bandwidth, or in my understanding of NVLink?

Here is how I read this: The spec states (1) 4 NVLinks (2) each NVlink has 600 GB/sec of bidirectional bandwidth. That equates to 2.4 TB/sec of aggregate bandwidth.

In my industry experience these kind of documents are typically created by the marketing department, trying to dazzle people with big numbers. Bandwidth numbers of any kind are typically derived from multiplying interface width with interface clock speed times any applicable double/quadruple-pumping, so are theoretical and not achievable in practice.

If you are trying to make technical assessments for a project, this is probably not the data you want to base plans on. It would be best to run specific benchmarks relevant to your use case, and take it from there. It appears you have already started down that path.

Thanks for the reply. I understand that this kind of theoretical estimation is not precise, and that might be why the nccl-test numbers were 220GB/s instead of 300GB/s. There will be more use-case specific testing.

But I’m doing the estimation here because I will soon finalize an order with a supplier, and needed to make sure my machine performs reasonably close to specs.

The thing with the 4 x 600 calculation is that 600GB/s already involves two GPUs - it describes a connection. If a 2-GPU system had 600GB/s bidirectional interconnect bandwidth, it makes no sense to claim that the system has 1200GB/s aggregate bandwidth. The same thing applies with 4 GPUs. I suppose marketing might like to simplify the computations, but if I didn’t misunderstand the specs and the 2400GB/s really is a kind of double-counting, then in my opinion at least they are providing misinformation to customers.

I do not see any double counting, just simple arithmetic. The links are full duplex, thus 600 GB/sec of bidirectional bandwidth per link. 4 links at 600 GB/sec result in aggregate bandwidth of 2.4 TB/sec. No trickery there. Now: Is that number meaningful in a practically applicable way? I suspect not. But I am not very knowledgeable about such big systems, they are not something I could afford.

It is the middle of the night in the US right now, so my suggestion would be to wait for an authoritative answer from an NVIDIA employee.

Alright, I’ll see what the answer from NVIDIA is. Thanks for helping as a community contributor. Yeah, this kind of system is indeed quite beefy, it’s mainly something for training large deep learning models.

I’ll leave a note here that the arithmetic shouldn’t work that way though. The HGX is connected as a 4 node fully connected graph, like this:

If anything, that would mean 3.6 TB/s. But AFAIK each group of 4 parallel lines in the diagram is supposed to be 200GB/s bidirectional, not 600GB/s. From my understanding the 600GB/s happens only because there are 3 paths connecting any 2 GPUs.

In order to make sense of this number, you have to look at the bandwidth from the perspective of each GPU. For example, suppose we ran a data exchange bidirectionally between all GPU pairs (1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3 and 4)

If we ran that kind of test, and measured the bandwidth per GPU, for example using a methodology similar to the bandwidthTest utility, each GPU will report a number that is “consistent” with the 600GB/s peak theoretical number (i.e. it will report a measured bandwidth that is somewhat below that number.)

If we add those together, we get the measured bandwidth, measured by all 4 GPUs. If we add the corresponding peak theoretical numbers together, we get the peak theoretical number of the sum of all 4 GPUs. That is the number reported.

Yes, you can observe that the write bandwidth to the memory on GPU 1 from GPU 3 corresponds to the read bandwidth from the memory on GPU 3 to GPU 1. That is not what is being reported.

The 600GB/s number represents the (peak theoretical) NVLink aggregate bandwidth (for both read and write) corresponding to all 12 links added together.

In our previous treatment, if we just considered the bidirectional test between 1 and 3, for example, and ran that test alone, the measured memory bandwidth at a single GPU (either 1 or 3) adding read and write together would be 200GB/s (peak theoretical, i.e. the measured bandwidth would be some number below that, consistent with that.)

There is no way to either run those tests I described, or to sum the bidirectional bandwidths of the NVLink connections, to get to a number consistent with 3.6TB/s. You end up with a measurement or calculation that is consistent with 2.4TB/s.

I see, the bandwidth is reported as a sum of read and write traffic across the 4 GPU nodes, not as a sum of traffic across the 6 edges between GPUs. That would indeed count the traffic across an edge twice, since it would appear in the write traffic of the origin and the read traffic of the destination. If so, the architecture and specs make sense, thanks for the clarification.

I notice that bandwidthTest only measures within a single GPU, and in that case there would be both read and write traffic from the same DRAM, yes. I guess the 2.4TB/s number kind of makes sense if the whole HGX A100 is treated as a single device that reads to and writes from a singular block of memory.

(Though I’m guessing for inter-GPU communication, all writes must correspond with an equal amount of reads, unlike a single DRAM where some workloads can create purely read traffic or purely write traffic?)

Another form of inter-GPU traffic is direct access from a remote SM. When GPU memory of one GPU is peer-mapped into another GPU, then code running on that GPU can directly access the remote memory. In this case read and write traffic originating on an SM of GPU A can result in memory bandwidth utilization on GPU B (as well as bandwidth utilization on the NVLink connecting those 2 GPUs) without any corresponding memory read/write traffic to GPU A memory.