Why is the amount of thread blocks per cluster and the dynamic shared memory that I can allocate much lower than expected?

mrakgr · December 8, 2024, 5:39pm

Link: The __cluster_dims__ annotation doesn’t work as expected with CuPy. · Issue #8778 · cupy/cupy

I originally thought this issue was related to CuPy, but it’s exactly matched by the code compiled natively with NVCC. It turns out that when launched in cooperative groups mode, the thread block cluster dimension cannot go above 8 even on a H100 card, and the max dynamic shared memory cannot go above 113kb.

This isn’t what I expected. I thought I’d be able to access 3.5mb of distributed shared memory in a thread block cluster instead of just 984kb.

Why is this happening? Did I misunderstand that it should be possible to use 16 blocks in a cluster in all cases, or am I configuring the kernel incorrectly?

rs277 · December 8, 2024, 6:25pm

You have to configure for non-portable sizes. See Section 1.4.1.3 here:

“Launching a kernel with a nonportable cluster size requires setting the cudaFuncAttributeNonPortableClusterSizeAllowed function attribute.”

mrakgr · December 9, 2024, 8:02am

Good catch. But what about the shared memory limit being so much lower than expected when using 8 blocks per cluster?

striker159 · December 9, 2024, 8:48am

Cooperative launch with maximum amount of shared memory and cluster size 8 works for me if 120 blocks instead of 128 blocks are used.

Note than when using NCU to profile an ordinary launch of the kernel (on GraceHopper) with a grid size of <<<128,256, 200*1024>>> , the profiler reports 1.07 waves per SM. The number of active clusters is reported as 15, not 16. Which means not all blocks can be scheduled simultaneously and thus cooperative launch is impossible.

mrakgr · December 9, 2024, 9:09am

#include <stdio.h>
#include <cuda_runtime.h>
#include <iostream>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError error, const char *file, int line, bool abort=true) {
    if (error != cudaSuccess) {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(error), file, line);
        if (abort) exit(error);
    }
}

__global__ void __cluster_dims__(16) hello(int a, int b) {
    if (threadIdx.x == 0) {
        printf("Hello, CUDA from %i! %i + %i = %i\n", blockIdx.x, a, b, a + b);
    }
}

int main() {
    size_t maxDynamicSharedMemory = 214 * (1 << 10);
    gpuErrchk(cudaFuncSetAttribute(hello, cudaFuncAttributeMaxDynamicSharedMemorySize, maxDynamicSharedMemory));
    gpuErrchk(cudaFuncSetAttribute(hello, cudaFuncAttributeNonPortableClusterSizeAllowed, 16));

    int i = 0;
    int dev = 0;
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, dev);
    void * args[] = {reinterpret_cast<void*>(&i), reinterpret_cast<void*>(&i)};
    gpuErrchk(cudaLaunchCooperativeKernel(hello, 112, 256, args, maxDynamicSharedMemory));
    gpuErrchk(cudaDeviceSynchronize());
    std::cout << "Done." << std::endl;
    return 0;
}

Surprisingly, the above works. Great! Why cannot I have it launch 128 blocks, and have to drop down to 112? The card I am using should have 132 SMs.

striker159 · December 9, 2024, 9:24am

I could not figure it out.

The number of blocks must be a multiple of the cluster size which makes using 132 blocks with cluster size 8 or 16 impossible.
According to this simple calculation, cluster size 4 should work with 132 blocks, and cluster size 8 and 16 should work with 128 blocks. But they don’t.
However, cluster size 1 and 2 work with 132 blocks.

It probably depends on how the TPCs were reduced compared to the full GH100 GPU chip.
According to this block post: NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

The full chip has The full implementation of the GH100 GPU includes the following units:
8 GPCs, 72 TPCs (9 TPCs/GPC), 2 SMs/TPC, 144 SMs per full GPU

and

The NVIDIA H100 GPU with SXM5 board form-factor includes the following units:
8 GPCs, 66 TPCs, 2 SMs/TPC, 132 SMs per GPU

For example, if there were 7 GPCs with 9 TPCs and 1 GPC with with 3 TPCs, the last GPC can neither handle a cluster of size 8 nor of size 16.

mrakgr · December 15, 2024, 11:31am

I was going to open a support ticket to ask this, but your answer strikes me as highly likely. It’s weird that they’d put in 8 GPCs, but then have 18 SMs in each of them instead of 16. It doesn’t match the Cuda programming model at all.

Curefab · December 15, 2024, 8:38pm

The high-level division of the GPUs are not so much about the Cuda programming model, but more about manufacturability. Many dies have parts (e.g. a SM), which are defect, and the whole die can still be used.

system · December 29, 2024, 8:39pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Launching Kernel Fail CUDA Programming and Performance	15	3400	May 28, 2014
Cluster size limitation CUDA Programming and Performance	4	529	February 1, 2024
Which entity will execute one block? A single Cuda core or a SM? CUDA Programming and Performance	13	17070	December 7, 2010
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1295	November 8, 2023
Why is shared memory configuration size is limiting the occupancy CUDA Programming and Performance kernel , profiling	2	899	June 4, 2023
Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is CUDA Programming and Performance	5	2810	April 29, 2009
shared memory and CUDA calculator CUDA Programming and Performance	6	4036	October 26, 2008
What is the performance impact of launching many many small blocks? CUDA Programming and Performance cuda , kernel	7	71	November 7, 2024
Maximum number of blocks Legacy PGI Compilers	5	2382	April 7, 2020
The choose of grid size and block size CUDA Programming and Performance	8	1384	May 8, 2024

Why is the amount of thread blocks per cluster and the dynamic shared memory that I can allocate much lower than expected?

Related topics