Suppose I need to experiment cluster dimensions when launching a kernel and need to map some SM units to specific regions of data and accelerate the memory operations like global-atomic increments on similar items within a group of blocks and same in other groups in themselves (so expecting cuda hardware to somehow reduce/accelerate the atomic requests on same element between these SM units in same cluster). Does only H100 GPU support this?
On Colab A100 GPU: a kernel launch error has occurred due to cluster misconfiguration error is returned when using cluster config.
On local computer with 4070 gpu: same error, even with 12.8 toolkit version.
On 5070 gpu: A breakpoint instruction (__debugbreak() statement or a similar call) was executedis returned when trying cluster config. This looks like the support for 5070 is not yet complete (and possibly won’t work for cluster atrribute in configuration of kernel launch)
Another question about cluster, does an atomic add operation really get accelerated if all SM units of such access are from same GPC or same cluster? Is there also similar acceleration when launching cooperative kernel with shared memory sharing between blocks? About texture memory, does GPC or a cluster also cache the textures on another level of caching? Or is it mainly for writing the data and making it visible from other blocks?
int device;
cudaGetDevice(&device); // Get the currently active device
cudaDeviceProp prop2;
cudaGetDeviceProperties(&prop2, device); // Get properties of the device
printf(" %s ", prop2.name);
printf(" CC %i %i", prop2.major, prop2.minor);
output is
NVIDIA GeForce RTX 5070 CC 12 0
So some specs of 9.0 only for 9.0 even 12.0 will not support? Or is it about being a desktop gpu unlike H100?
I’m using compiler flags like these: compute_89,sm_89;compute_90,sm_90.
At least is there a way to emulate / simulate so that I can gain some experience about cluster on colab A100 or desktop 5070? Even if it works slower on emulation I don’t care, just requiring some experience without purchasing an H100.
thread block clusters are supported on cc9.0 and higher. From here:
Compute capability 9.0 and above allows users to specify compile time thread block cluster dimensions, so that the kernel can use the cluster hierarchy in CUDA.
So my error was defining a non-const variable to set it before launching a kernel. Then, if I have to support multiple configurations, I have to prepare for all possible scenarios and use constexpr values?
gpuAttributes[0].val is being initialized in run-time. But when I move codes into constexpr function, it says a constexpr function can not have nonliteral type cudaLaunchConfig_t as return value. I guess it uses some pointers inside and pointers are not known at compile time.
Thank you. Breakpoint error was from a double-freed pointer and the misconfiguration error was from not using cudaSetDevice(num) to select the proper gpu that supports it. So now it works without problem but only for 5070. 4070 is CC 8.9 missing the target by 0.1 version.