Query regarding launch_block_size and launch_thread_count reported by Nsight Compute for CUDA kernel

Hi all,

I used the cublas library to launch a CUDA kernel for matrix multiplication, and I’ve been using Nsight Compute to analyze the performance. However, I’ve noticed that there seems to be a discrepancy between the reported launch__block_size and launch__thread_count metrics in Nsight Compute and the actual thread block size and number of threads across all blocks that are executed by the GPU during the kernel launch.

Here are the details of what I ran and the observations I made:

  • Kernel name: volta_sgemm_32x32_sliced1x4nn
  • Input matrix size: m=n=k=128
  • Grid size: 4x4x1 (total of 16 thread blocks)
  • Block size: 128x1x1 (total of 128 threads per thread block)
  • Reported launch__block_size: 128x1x1
  • Reported launch__thread_count: 2048

When I analyzed the kernel using the Nsight Compute profiler, However, the reported launch__block_size is 128x1x1, which seems to suggest that there are only 128 threads per block. Additionally, the reported launch__thread_count is also 2048, which doesn’t match the expected total number of threads.

Based on my understanding, the actual thread block size should be 32x32x1 because the kernel’s tile size is 32x32 as mentioned in its name. Then, each thread block contained 1024 threads. Therefore, the total number of threads for the kernel launch should be equal to the grid size times the number of threads per block, which in this case should be 16x1024 = 16384.

Can anyone explain why there is a discrepancy between the reported launch__block_size and launch__thread_count metrics and the actual thread block size and number of threads across all blocks that are executed by the GPU during the kernel launch? Is there a better way to determine the actual total number of threads for a kernel launch?

Any help would be greatly appreciated. Thanks!

As I understand it, the sgemm_32x32 kernel is optimized for matrices that can be divided into 32x32 tiles, but it does not imply that the thread block size is determined by the tile size. The thread block size will be determined by the library based on what it thinks will give the best performance. I could be wrong and if you have documentation to the contrary, please pass it along. In this case, it used a 128x1x1 block size (128 threads per block) with a 4x4x1 grid of those blocks. This resulted in 2048 threads working on the problem. Nsight Compute is reporting what really happened based on how the library decomposed the problem.

@jmarusarz In that case, how do you explain the calculation of all the outputs. How are 2048 threads calculating 16384 outputs (and why) ?

I don’t know for sure what’s happening behind the scenes, but there’s nothing stopping each thread from calculating more than one output in a loop. I don’t think there is any requirement that says each thread can only do one calculation. The library may break the matrix up into small chunks that each thread is responsible for calculating. As for why, that’s a question for the sgemm team, and I don’t have an answer for that.