In CUDA C++ Programming Guide,
Max cluster size is limited to 8 for some arch and 16 for h100;
But is cluster size limited to 1, 2, 4, 8, 16?
If cluster size set to 9, cluster_per_gpc will be 2, cta_per_gpc will be 18, every gpc sm_per_gpc = 18?
Thanks.
The only arch at the moment that is relevant is 9.0 (Hopper/H100). The max cluster size for that is 8.
In the future, if other architectures present other options, you can query that as indicated.
For good, portable code, you should query the property indicated:
The number of thread blocks in a cluster can be user-defined, and a maximum of 8 thread blocks in a cluster is supported as a portable cluster size in CUDA. Note that on GPU hardware or MIG configurations which are too small to support 8 multiprocessors the maximum cluster size will be reduced accordingly. Identification of these smaller configurations, as well as of larger configurations supporting a thread block cluster size beyond 8, is architecture-specific and can be queried using the cudaOccupancyMaxPotentialClusterSize
API.
@Robert_Crovella
In Hopper arch and cutlass, 4x4 cluster size can be set.
Is cluster size limited to 1, 2, 4, 8, 16?
That doesn’t seem to be the case:
$ cat t24.cu
// Kernel definition
// Compile time cluster size 3 in X-dimension and 1 in Y and Z dimension
#include <iostream>
#define N (30*1024)
__global__ void __cluster_dims__(3, 1, 1) cluster_kernel(float *input, float* output)
{
}
int main()
{
float *input=NULL, *output=NULL;
// Kernel invocation with compile time cluster size
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
// The grid dimension is not affected by cluster launch, and is still enumerated
// using number of blocks.
// The grid dimension must be a multiple of cluster size.
cluster_kernel<<<numBlocks, threadsPerBlock>>>(input, output);
cudaError_t err = cudaGetLastError();
std::cout << cudaGetErrorString(err) << std::endl;
}
$ nvcc -o t24 t24.cu -arch=sm_90
$ ./t24
no error
$