Hello everyone,

I’m working on estimating the time it takes for LLM inference during the autoregressive part of token generation. I’m looking at scenarios with different LLMs (70B+ parameters, FP16 precision) and need to calculate the generation time using the formula:

Time per token = Total number of bytes moved (the model weights) / Memory bandwidth

In this context and if I use an HGX system, should I use the total aggregate bandwidth of the HGX system for my calculations?

I appreciate your help.

I think the bandwidth being referred to here is the *memory* bandwidth. In that case, I would add up the device memory bandwidths for the number of GPUs used to run the 70B model inference. When you say “total aggregate bandwidth of the HGX system” I’m not really sure what you are referring to exactly. If you are referring to the sum of all the device memory bandwidths and are inferencing using all the GPUs in the HGX, then we are saying the same thing. If you are referring to NVLink bandwidth, or are doing inferencing on some subset of the GPUs, then we are not saying the same thing.

Aside:

Your formula is a basic memory-bounded calculation/estimate, not really unique or specific to LLMs, but is perhaps somewhat “sensible” because to do a single forward inference pass on an LLM, AFAIK you will need to access each parameter once, from device main memory.

You would want to be sure a memory bounded calculation is the best approach there. Certainly with batching, in some cases generation may become compute bound and not memory bound, but it may be that the “autoregressive part of token generation” you are referring to doesn’t represent that case. Alternatively, with KV caching, the process may become memory bound again.

Hi Robert,

Thank you for your reply. Yes, I was referring to the memory bandwidth, and by “total aggregate bandwidth of the HGX system”, I meant the number that’s displayed in the table here: NVIDIA HGX AI Supercomputing Platform.

I assumed using a HGX H100 (4-GPU), and utilizing all 4 GPUs to run the model inference, and yes, the formula I presented is indeed for memory-bounded calculation.

The blog post you shared looks very helpful. I’ll dive into it as soon as possible.

I appreciate your help/