I originally thought this issue was related to CuPy, but it’s exactly matched by the code compiled natively with NVCC. It turns out that when launched in cooperative groups mode, the thread block cluster dimension cannot go above 8 even on a H100 card, and the max dynamic shared memory cannot go above 113kb.
This isn’t what I expected. I thought I’d be able to access 3.5mb of distributed shared memory in a thread block cluster instead of just 984kb.
Why is this happening? Did I misunderstand that it should be possible to use 16 blocks in a cluster in all cases, or am I configuring the kernel incorrectly?
Cooperative launch with maximum amount of shared memory and cluster size 8 works for me if 120 blocks instead of 128 blocks are used.
Note than when using NCU to profile an ordinary launch of the kernel (on GraceHopper) with a grid size of <<<128,256, 200*1024>>> , the profiler reports 1.07 waves per SM. The number of active clusters is reported as 15, not 16. Which means not all blocks can be scheduled simultaneously and thus cooperative launch is impossible.
The number of blocks must be a multiple of the cluster size which makes using 132 blocks with cluster size 8 or 16 impossible.
According to this simple calculation, cluster size 4 should work with 132 blocks, and cluster size 8 and 16 should work with 128 blocks. But they don’t.
However, cluster size 1 and 2 work with 132 blocks.
I was going to open a support ticket to ask this, but your answer strikes me as highly likely. It’s weird that they’d put in 8 GPCs, but then have 18 SMs in each of them instead of 16. It doesn’t match the Cuda programming model at all.
The high-level division of the GPUs are not so much about the Cuda programming model, but more about manufacturability. Many dies have parts (e.g. a SM), which are defect, and the whole die can still be used.