Why is the amount of thread blocks per cluster and the dynamic shared memory that I can allocate much lower than expected?

Link: The __cluster_dims__ annotation doesn’t work as expected with CuPy. · Issue #8778 · cupy/cupy

I originally thought this issue was related to CuPy, but it’s exactly matched by the code compiled natively with NVCC. It turns out that when launched in cooperative groups mode, the thread block cluster dimension cannot go above 8 even on a H100 card, and the max dynamic shared memory cannot go above 113kb.

This isn’t what I expected. I thought I’d be able to access 3.5mb of distributed shared memory in a thread block cluster instead of just 984kb.

Why is this happening? Did I misunderstand that it should be possible to use 16 blocks in a cluster in all cases, or am I configuring the kernel incorrectly?

You have to configure for non-portable sizes. See Section 1.4.1.3 here:

“Launching a kernel with a nonportable cluster size requires setting the cudaFuncAttributeNonPortableClusterSizeAllowed function attribute.”

Good catch. But what about the shared memory limit being so much lower than expected when using 8 blocks per cluster?

Cooperative launch with maximum amount of shared memory and cluster size 8 works for me if 120 blocks instead of 128 blocks are used.

Note than when using NCU to profile an ordinary launch of the kernel (on GraceHopper) with a grid size of <<<128,256, 200*1024>>> , the profiler reports 1.07 waves per SM. The number of active clusters is reported as 15, not 16. Which means not all blocks can be scheduled simultaneously and thus cooperative launch is impossible.

1 Like
#include <stdio.h>
#include <cuda_runtime.h>
#include <iostream>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError error, const char *file, int line, bool abort=true) {
    if (error != cudaSuccess) {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(error), file, line);
        if (abort) exit(error);
    }
}

__global__ void __cluster_dims__(16) hello(int a, int b) {
    if (threadIdx.x == 0) {
        printf("Hello, CUDA from %i! %i + %i = %i\n", blockIdx.x, a, b, a + b);
    }
}

int main() {
    size_t maxDynamicSharedMemory = 214 * (1 << 10);
    gpuErrchk(cudaFuncSetAttribute(hello, cudaFuncAttributeMaxDynamicSharedMemorySize, maxDynamicSharedMemory));
    gpuErrchk(cudaFuncSetAttribute(hello, cudaFuncAttributeNonPortableClusterSizeAllowed, 16));

    int i = 0;
    int dev = 0;
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, dev);
    void * args[] = {reinterpret_cast<void*>(&i), reinterpret_cast<void*>(&i)};
    gpuErrchk(cudaLaunchCooperativeKernel(hello, 112, 256, args, maxDynamicSharedMemory));
    gpuErrchk(cudaDeviceSynchronize());
    std::cout << "Done." << std::endl;
    return 0;
}

Surprisingly, the above works. Great! Why cannot I have it launch 128 blocks, and have to drop down to 112? The card I am using should have 132 SMs.

I could not figure it out.

The number of blocks must be a multiple of the cluster size which makes using 132 blocks with cluster size 8 or 16 impossible.
According to this simple calculation, cluster size 4 should work with 132 blocks, and cluster size 8 and 16 should work with 128 blocks. But they don’t.
However, cluster size 1 and 2 work with 132 blocks.

It probably depends on how the TPCs were reduced compared to the full GH100 GPU chip.
According to this block post: NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

The full chip has The full implementation of the GH100 GPU includes the following units:
8 GPCs, 72 TPCs (9 TPCs/GPC), 2 SMs/TPC, 144 SMs per full GPU

and

The NVIDIA H100 GPU with SXM5 board form-factor includes the following units:
8 GPCs, 66 TPCs, 2 SMs/TPC, 132 SMs per GPU

For example, if there were 7 GPCs with 9 TPCs and 1 GPC with with 3 TPCs, the last GPC can neither handle a cluster of size 8 nor of size 16.

1 Like

I was going to open a support ticket to ask this, but your answer strikes me as highly likely. It’s weird that they’d put in 8 GPCs, but then have 18 SMs in each of them instead of 16. It doesn’t match the Cuda programming model at all.

The high-level division of the GPUs are not so much about the Cuda programming model, but more about manufacturability. Many dies have parts (e.g. a SM), which are defect, and the whole die can still be used.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.