Performance Gap Between Single Operator Profiling and LLM Model Profiling

Hello ~

I’m trying to optimize operator gptq_gemm() in LLM with 4bit gptq quantization ,gptq_gemm is actually a mixed Precision gemm,which is implemented by two cuda kernels:

  1. the first kernel dequantize the 4bit weight to float16 weights,
  2. the second kernel is cublasHgemm.

When using nsight-systems of version 2023.3.3 on NVIDIA GeForce RTX 4090 , performance of the dequantize kernel is different between single gptq_gemm() and LLM running.

Taking a int4 weight of shape (4096, 4096) as an example:

  1. The first Case : in single gptq_gemm() profiling , I run the function some times,followed by a cache flushing each time, pseudocode is like :
for i in range(0, N): 
     gptq_gemm()    
     cache.zero()  # Cache flusing by writing a huge memory 

  1. The Second Case : profiling the whole LLM Model runing

Duration of the Dequantization is quite different, much faster in LLM Model Running

  1. However , when Profiling using Nsight-Compute , duration is same in two scenario:

but different with values in nsys

  1. So, Why there are different Latency values , between the 2 scenario of nsys, and between nsys and ncu ?
    Which one is the real lantency of the dequant op ?

Nsight Compute always serializes all kernels. So it make sense that since only one kernel is ever running at a time, they are always the same duration, whether you are looking at the broader run or not.

Nsight Systems analyzes without serialization. I can’t tell you exactly what is going on without diving deeper into the results, but I would suspect that there is some context switching going on in the single operator case. I would look at the timeline around that kernel on the GPU and see if there are other things going on.

If you do, you might want to refer to https://developer.nvidia.com/blog/understanding-the-visualization-of-overhead-and-latency-in-nsight-systems/ to help you understand what is going on.

1 Like