I used the cublas library to launch a CUDA kernel for matrix multiplication, and I’ve been using Nsight Compute to analyze the performance. However, I’ve noticed that there seems to be a discrepancy between the reported
launch__thread_count metrics in Nsight Compute and the actual thread block size and number of threads across all blocks that are executed by the GPU during the kernel launch.
Here are the details of what I ran and the observations I made:
- Kernel name:
- Input matrix size: m=n=k=128
- Grid size: 4x4x1 (total of 16 thread blocks)
- Block size: 128x1x1 (total of 128 threads per thread block)
When I analyzed the kernel using the Nsight Compute profiler, However, the reported
launch__block_size is 128x1x1, which seems to suggest that there are only 128 threads per block. Additionally, the reported
launch__thread_count is also 2048, which doesn’t match the expected total number of threads.
Based on my understanding, the actual thread block size should be 32x32x1 because the kernel’s tile size is 32x32 as mentioned in its name. Then, each thread block contained 1024 threads. Therefore, the total number of threads for the kernel launch should be equal to the grid size times the number of threads per block, which in this case should be 16x1024 = 16384.
Can anyone explain why there is a discrepancy between the reported
launch__thread_count metrics and the actual thread block size and number of threads across all blocks that are executed by the GPU during the kernel launch? Is there a better way to determine the actual total number of threads for a kernel launch?
Any help would be greatly appreciated. Thanks!