Cluster size limitation

harlan.zhang · January 23, 2024, 2:23am

In CUDA C++ Programming Guide,
Max cluster size is limited to 8 for some arch and 16 for h100;

But is cluster size limited to 1, 2, 4, 8, 16?

If cluster size set to 9, cluster_per_gpc will be 2, cta_per_gpc will be 18, every gpc sm_per_gpc = 18?

Thanks.

Robert_Crovella · January 23, 2024, 12:36pm

~~The only arch at the moment that is relevant is 9.0 (Hopper/H100). The max cluster size for that is 8.~~

~~In the future, if other architectures present other options, you can query that as indicated.~~

For good, portable code, you should query the property indicated:

The number of thread blocks in a cluster can be user-defined, and a maximum of 8 thread blocks in a cluster is supported as a portable cluster size in CUDA. Note that on GPU hardware or MIG configurations which are too small to support 8 multiprocessors the maximum cluster size will be reduced accordingly. Identification of these smaller configurations, as well as of larger configurations supporting a thread block cluster size beyond 8, is architecture-specific and can be queried using the cudaOccupancyMaxPotentialClusterSize API.

harlan.zhang · January 24, 2024, 1:34am

@Robert_Crovella

In Hopper arch and cutlass, 4x4 cluster size can be set.

harlan.zhang · January 24, 2024, 1:35am

Is cluster size limited to 1, 2, 4, 8, 16?

Robert_Crovella · February 1, 2024, 8:28pm

That doesn’t seem to be the case:

$ cat t24.cu
// Kernel definition
// Compile time cluster size 3 in X-dimension and 1 in Y and Z dimension
#include <iostream>
#define N (30*1024)
__global__ void __cluster_dims__(3, 1, 1) cluster_kernel(float *input, float* output)
{

}

int main()
{
    float *input=NULL, *output=NULL;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks.
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel<<<numBlocks, threadsPerBlock>>>(input, output);
    cudaError_t err = cudaGetLastError();
    std::cout << cudaGetErrorString(err) << std::endl;
}
$ nvcc -o t24 t24.cu -arch=sm_90
$ ./t24
no error
$

Topic		Replies	Views
Why is the amount of thread blocks per cluster and the dynamic shared memory that I can allocate much lower than expected? CUDA Programming and Performance	8	135	December 15, 2024
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27724	February 15, 2010
Why is shared memory configuration size is limiting the occupancy CUDA Programming and Performance kernel , profiling	2	1001	June 4, 2023
What is the maximum number of blocks I can use? CUDA Programming and Performance	3	2892	February 8, 2022
Maximum stack size? CUDA Programming and Performance	7	964	March 24, 2024
Why is the z dimension smaller than the total thread block size limit CUDA Programming and Performance	1	395	March 28, 2023
CUDA - thread block confusion concept clearity sought CUDA Programming and Performance	6	3001	November 10, 2011
How to determine the Block Size CUDA Programming and Performance	1	5904	September 4, 2009
Maximum number of threads on thread block CUDA Programming and Performance	12	74373	September 21, 2023
Invalid Configuration Argument CUDA Programming and Performance	2	1860	December 16, 2018

Cluster size limitation

Related topics