Hello ~
I’m trying to optimize operator gptq_gemm() in LLM with 4bit gptq quantization ,gptq_gemm is actually a mixed Precision gemm,which is implemented by two cuda kernels:
- the first kernel dequantize the 4bit weight to float16 weights,
- the second kernel is cublasHgemm.
When using nsight-systems of version 2023.3.3 on NVIDIA GeForce RTX 4090 , performance of the dequantize kernel is different between single gptq_gemm() and LLM running.
Taking a int4 weight of shape (4096, 4096) as an example:
- The first Case : in single gptq_gemm() profiling , I run the function some times,followed by a cache flushing each time, pseudocode is like :
for i in range(0, N):
gptq_gemm()
cache.zero() # Cache flusing by writing a huge memory
- The Second Case : profiling the whole LLM Model runing
Duration of the Dequantization is quite different, much faster in LLM Model Running
- However , when Profiling using Nsight-Compute , duration is same in two scenario:
but different with values in nsys
- So, Why there are different Latency values , between the 2 scenario of nsys, and between nsys and ncu ?
Which one is the real lantency of the dequant op ?