Does only H100 support the kernel launch configuration attribute for clustering?

Suppose I need to experiment cluster dimensions when launching a kernel and need to map some SM units to specific regions of data and accelerate the memory operations like global-atomic increments on similar items within a group of blocks and same in other groups in themselves (so expecting cuda hardware to somehow reduce/accelerate the atomic requests on same element between these SM units in same cluster). Does only H100 GPU support this?

On Colab A100 GPU: a kernel launch error has occurred due to cluster misconfiguration error is returned when using cluster config.

On local computer with 4070 gpu: same error, even with 12.8 toolkit version.

On 5070 gpu: A breakpoint instruction (__debugbreak() statement or a similar call) was executedis returned when trying cluster config. This looks like the support for 5070 is not yet complete (and possibly won’t work for cluster atrribute in configuration of kernel launch)


Another question about cluster, does an atomic add operation really get accelerated if all SM units of such access are from same GPC or same cluster? Is there also similar acceleration when launching cooperative kernel with shared memory sharing between blocks? About texture memory, does GPC or a cluster also cache the textures on another level of caching? Or is it mainly for writing the data and making it visible from other blocks?

Both are probably to be expected, as Thread Block Cluster support was only added to CC 9.0 and above.

When querying CC from C++,


    int device;
    cudaGetDevice(&device);  // Get the currently active device

    cudaDeviceProp prop2;
    cudaGetDeviceProperties(&prop2, device);  // Get properties of the device

    printf("  %s ", prop2.name);
    printf(" CC %i %i", prop2.major, prop2.minor);

output is

NVIDIA GeForce RTX 5070  CC 12 0

So some specs of 9.0 only for 9.0 even 12.0 will not support? Or is it about being a desktop gpu unlike H100?

I’m using compiler flags like these:
compute_89,sm_89;compute_90,sm_90.

At least is there a way to emulate / simulate so that I can gain some experience about cluster on colab A100 or desktop 5070? Even if it works slower on emulation I don’t care, just requiring some experience without purchasing an H100.

Looking in the PTX manual, 9.0 and above seems to apply to cluster related instructions, with no mention of exceptions.

With 5070, just the following code

        cudaGetDeviceProperties(&prop2, 1);  // Get properties of the device

        printf("  %i ", prop2.clusterLaunch);

This returns

1

So yes, it looks like supporting but when trying a simple config like this:

    gpuAttributes[1].id = cudaLaunchAttributeClusterDimension;
    gpuAttributes[1].val.clusterDim.x = 1;
    gpuAttributes[1].val.clusterDim.y = 1;
    gpuAttributes[1].val.clusterDim.z = 1;
    gpuAttributes[1].val.clusterSchedulingPolicyPreference = cudaClusterSchedulingPolicyLoadBalancing;
    gpuAttributes[1].val.preferredClusterDim.x = 1;
    gpuAttributes[1].val.preferredClusterDim.y = 1;
    gpuAttributes[1].val.preferredClusterDim.z = 1;

it returns the error:

a kernel launch error has occurred due to cluster misconfiguration

is there a lower-limit of cluster x,y,z dimensions for a gpu?

Edit: I’m sorry, just found out that there’s missing extra id values like cudaLaunchAttributeClusterSchedulingPolicyPreference

With this:

    gpuAttributes[0].id = cudaLaunchAttributeClusterDimension;
    gpuAttributes[0].val.clusterDim.x = 1;
    gpuAttributes[0].val.clusterDim.y = 1;
    gpuAttributes[0].val.clusterDim.z = 1;
    gpuAttributes[1].id = cudaLaunchAttributeClusterSchedulingPolicyPreference;
    gpuAttributes[1].val.clusterSchedulingPolicyPreference = cudaClusterSchedulingPolicyDefault;
    gpuAttributes[2].id = cudaLaunchAttributePreferredClusterDimension;
    gpuAttributes[2].val.preferredClusterDim.x = 1;
    gpuAttributes[2].val.preferredClusterDim.y = 1;
    gpuAttributes[2].val.preferredClusterDim.z = 1;
    config.numAttrs = 3;
    config.attrs = &gpuAttributes[0];

error is invalid configuration argument

thread block clusters are supported on cc9.0 and higher. From here:

Compute capability 9.0 and above allows users to specify compile time thread block cluster dimensions, so that the kernel can use the cluster hierarchy in CUDA.

Looks like 1x1x1 may not be valid for a cluster grid.

Is it because a GPC can only have N amount of SM units inside?

compile time thread block cluster dimensions

So my error was defining a non-const variable to set it before launching a kernel. Then, if I have to support multiple configurations, I have to prepare for all possible scenarios and use constexpr values?

gpuAttributes[0].val is being initialized in run-time. But when I move codes into constexpr function, it says a constexpr function can not have nonliteral type cudaLaunchConfig_t as return value. I guess it uses some pointers inside and pointers are not known at compile time.

Is the following launch type not good enough?

cudaLaunchKernelExC(&config, (void*)k_something, args);

Edit: I’m not using __cluster_dims__(2, 1, 1) is this mandatory to add to a kernel definition?

There are two methods to specify cluster dimensions. One is compile time and one is runtime. Both methods are discussed here with examples of each.

1 Like

Thank you. Breakpoint error was from a double-freed pointer and the misconfiguration error was from not using cudaSetDevice(num) to select the proper gpu that supports it. So now it works without problem but only for 5070. 4070 is CC 8.9 missing the target by 0.1 version.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.