For GEMM with large K, and not use sliceK, why the occupancy is low?

Hi! I am learning algorithm for GEMM with large K. I find a version running quite fast for normal size, the link is here But if we use large K like: 3072 * 3072 3072 * 128, we will meet this:

Analyzer said: The difference between calculated theoretical (50.0%) and measured achieved occupancy (35.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution.

Why? Why the occupancy is so low?? Thank you!!!

(By the way, this algorithm works quie fast, almost same to cutlass for normal size)

GTX 1650 has 14 Turing SMs. Each Turing SM can support 1024 threads for full occupancy. Therefore, for “full” occupancy, it would require 56 threadblocks of 256 threads each.

There is something else about the kernel (probably register usage - 128 registers/thread are indicated) that limits “theoretical” occupancy to 50%. This would translate to 28 blocks of 256 threads each.

According to the output here, 24 blocks of 256 threads are being launched.

24/56 = 42% occupancy = maximum achievable occupancy for this launch.

The measured achieved occupancy (35.8%) is a bit lower, perhaps due to tail effects - “waves per sm” or other effects.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.