Interpretation of "total aggregate bandwidth" for HGX A100

I am testing an HGX 4xA100 system for my company. According to Powerful Server Platform for AI & HPC | NVIDIA HGX A100, the HGX 4xA100 system supports a total aggregate bandwidth of 2.4 TB/s. But I cannot understand how this number is computed, it doesn’t make sense considering the specs of the NVLink connections.

According to the developer blog “Introducing NVIDIA HGX A100: The Most Powerful Accelerated Server Platform for AI and High Performance Computing”, each A100 GPU has 12 NVLink connections. This would mean the HGX A100 has 4 NVLinks between any two GPU’s, and 24 NVLinks in the system total, which is what I observe when I run nvidia-smi topo -m. A100 systems use 3rd generation NVLink, which supports 25 GB/s single direction bandwidth.

I can see how this gives rise to 600 GB/s bandwidth between any given pair of GPU’s if we transfer the data bidirectionally in 3 parallel paths (1 direct path through the 4 NVLinks connecting the GPU’s, and 2 indirect paths each using 8 NVLinks that first pass through the other GPU’s before converging onto the destination).

But 4 * 12 * 25 = 1200 GB/s, and this number is already the bidirectional bandwidth since each edge between two GPU’s is double counted this way.

When I run the alltoall benchmark in nccl-tests, I get an average bus bandwidth of 220GB/s. Even if we assume that means the theoretical bus bandwidth was 300GB/s, this also ends up being consistent with an aggregate system bandwidth of 1200GB/s, not 2400GB/s.

Is there something wrong in my interpretation of aggregate bandwidth, or in my understanding of NVLink?

Here is how I read this: The spec states (1) 4 NVLinks (2) each NVlink has 600 GB/sec of bidirectional bandwidth. That equates to 2.4 TB/sec of aggregate bandwidth.

In my industry experience these kind of documents are typically created by the marketing department, trying to dazzle people with big numbers. Bandwidth numbers of any kind are typically derived from multiplying interface width with interface clock speed times any applicable double/quadruple-pumping, so are theoretical and not achievable in practice.

If you are trying to make technical assessments for a project, this is probably not the data you want to base plans on. It would be best to run specific benchmarks relevant to your use case, and take it from there. It appears you have already started down that path.

Thanks for the reply. I understand that this kind of theoretical estimation is not precise, and that might be why the nccl-test numbers were 220GB/s instead of 300GB/s. There will be more use-case specific testing.

But I’m doing the estimation here because I will soon finalize an order with a supplier, and needed to make sure my machine performs reasonably close to specs.

The thing with the 4 x 600 calculation is that 600GB/s already involves two GPUs - it describes a connection. If a 2-GPU system had 600GB/s bidirectional interconnect bandwidth, it makes no sense to claim that the system has 1200GB/s aggregate bandwidth. The same thing applies with 4 GPUs. I suppose marketing might like to simplify the computations, but if I didn’t misunderstand the specs and the 2400GB/s really is a kind of double-counting, then in my opinion at least they are providing misinformation to customers.

I do not see any double counting, just simple arithmetic. The links are full duplex, thus 600 GB/sec of bidirectional bandwidth per link. 4 links at 600 GB/sec result in aggregate bandwidth of 2.4 TB/sec. No trickery there. Now: Is that number meaningful in a practically applicable way? I suspect not. But I am not very knowledgeable about such big systems, they are not something I could afford.

It is the middle of the night in the US right now, so my suggestion would be to wait for an authoritative answer from an NVIDIA employee.

Alright, I’ll see what the answer from NVIDIA is. Thanks for helping as a community contributor. Yeah, this kind of system is indeed quite beefy, it’s mainly something for training large deep learning models.

I’ll leave a note here that the arithmetic shouldn’t work that way though. The HGX is connected as a 4 node fully connected graph, like this:

If anything, that would mean 3.6 TB/s. But AFAIK each group of 4 parallel lines in the diagram is supposed to be 200GB/s bidirectional, not 600GB/s. From my understanding the 600GB/s happens only because there are 3 paths connecting any 2 GPUs.

In order to make sense of this number, you have to look at the bandwidth from the perspective of each GPU. For example, suppose we ran a data exchange bidirectionally between all GPU pairs (1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3 and 4)

If we ran that kind of test, and measured the bandwidth per GPU, for example using a methodology similar to the bandwidthTest utility, each GPU will report a number that is “consistent” with the 600GB/s peak theoretical number (i.e. it will report a measured bandwidth that is somewhat below that number.)

If we add those together, we get the measured bandwidth, measured by all 4 GPUs. If we add the corresponding peak theoretical numbers together, we get the peak theoretical number of the sum of all 4 GPUs. That is the number reported.

Yes, you can observe that the write bandwidth to the memory on GPU 1 from GPU 3 corresponds to the read bandwidth from the memory on GPU 3 to GPU 1. That is not what is being reported.

The 600GB/s number represents the (peak theoretical) NVLink aggregate bandwidth (for both read and write) corresponding to all 12 links added together.

In our previous treatment, if we just considered the bidirectional test between 1 and 3, for example, and ran that test alone, the measured memory bandwidth at a single GPU (either 1 or 3) adding read and write together would be 200GB/s (peak theoretical, i.e. the measured bandwidth would be some number below that, consistent with that.)

There is no way to either run those tests I described, or to sum the bidirectional bandwidths of the NVLink connections, to get to a number consistent with 3.6TB/s. You end up with a measurement or calculation that is consistent with 2.4TB/s.

I see, the bandwidth is reported as a sum of read and write traffic across the 4 GPU nodes, not as a sum of traffic across the 6 edges between GPUs. That would indeed count the traffic across an edge twice, since it would appear in the write traffic of the origin and the read traffic of the destination. If so, the architecture and specs make sense, thanks for the clarification.

I notice that bandwidthTest only measures within a single GPU, and in that case there would be both read and write traffic from the same DRAM, yes. I guess the 2.4TB/s number kind of makes sense if the whole HGX A100 is treated as a single device that reads to and writes from a singular block of memory.

(Though I’m guessing for inter-GPU communication, all writes must correspond with an equal amount of reads, unlike a single DRAM where some workloads can create purely read traffic or purely write traffic?)

Another form of inter-GPU traffic is direct access from a remote SM. When GPU memory of one GPU is peer-mapped into another GPU, then code running on that GPU can directly access the remote memory. In this case read and write traffic originating on an SM of GPU A can result in memory bandwidth utilization on GPU B (as well as bandwidth utilization on the NVLink connecting those 2 GPUs) without any corresponding memory read/write traffic to GPU A memory.

I’m working on estimating the time it takes for LLM inference during the autoregressive part of token generation. I’m looking at scenarios with different LLMs and need to calculate the generation time using the formula:

Time per token = Total number of bytes moved (the model weights) / Memory bandwidth

In this context and if I use an HGX system, should I use the total aggregate bandwidth of the HGX system for my calculations?

I appreciate your help!

I think the reality of multi-GPU LLM inference is unfortunately quite complicated. Benchmarking with actual hardware and test suites is of course the best way forward, but I can give a general idea.
First, there’s many ways to do multi-GPU inference, different libraries use different strategies.
If you have enough VRAM per GPU, you can use data parallelism. Data parallelism requires the LLM to fit comfortably inside one GPU, and when used for inference there will be no communication needed between GPUs at all, so the throughput scales well with the GPU count. Your throughput will be the # of GPUs * the throughput per GPU. The latency per response will be the same as with 1 GPU.
If you need to shard a single LLM across multiple GPUs, you can use tensor parallelism. Tensor parallelism doesn’t require model weights to be communicated during inference, only the model activations. It can improve latency per response, but you will probably not get linear scaling of the throughput, i.e. using 2 GPUs will result in less than 2x throughput increase vs. using the same batch size on 1 GPU. The amount of data communicated is rather small in this strategy, so the execution time is dominated by computation costs and overheads.
The only parallelization strategy I can think of that actually moves model weights between GPUs during inference would be something like ZeRO. This basically implements data parallelism while also sharding the weights, by aggregating the weights of a layer right before using them. I don’t think it’s very good for inference, but if you are using this kind of parallelization and your model is very big, then the communication latency would be significant. If you have N GPUs, on each forward pass, each GPU would read (N - 1) / N of the model weights from communication, and each read corresponds to an equal number of writes. So you can estimate the communication latency per forward pass as (N - 1) * 2 * Size of Model Weights / Total Aggregate Bandwidth.
For example, using Llama-2 70B with 16-bit weights, on 4 GPUs, with a total aggregate bandwidth of 2400GB/s, this would be 0.35s of communication latency per forward pass. The actual time per forward pass is likely significantly higher. With batching, the actual throughput will be batch size * time per forward pass.