I have written a simple CUDA program to perform array reduction using thread block clusters and distributed shared memory. I am compiling it with CUDA 12.0 and running on a hopper GPU. Below is the code I use:
The outputs of the kernel and CPU do not match and as far as I know, there doesn’t seem to be any bugs in my code. Please help me find the issue with this code. Thanks!
What do you mean by “outputs do not match”? What are the actual values for cpu and gpu ?
Floating point math is not associative. Results may not be identical between a parallel version and a serial version.
Does it work when you use integers instead of floats?
Hi, thanks for your reply. I am creating an array of ones of size n=128 and computing the sum of this array on both CPU and GPU. On the CPU side, I get the correct sum value (sum = 128). But the result from the GPU is not 128, it is much less than that. The output of GPU is 64. I don’t think such a big difference can occur due to floating point errors and actual computation is a simple sum operation. Let me know if this clarifies your questions.
Thanks for the suggestion. I added a print statement right before we add the per block sum (shared_mem[0]) to the cluster_sum. The sum per block seems to be as expected. But when I also print the final cluster sum right before we add it to the total sum, it is less than expected. So I guess something is something goes wrong when we accumulate the cluster sums from the local block sum. It could either be synchronisation issue (cluster.sync()), or issue with cluster.map_shared_rank.
I also added a cudaMemset to initialise the sum variable to 0, but it doesn’t make any change to the codes output. Let me know if you have any further ideas.
Any update? I reproduce the issue. I add a printf function following
“atomicAdd(cluster.map_shared_rank(&cluster_sum, 0), shared_mem[0]);”
find atomicAdd on cluster.map_shared_rank(&cluster_sum, 0) is not as expected.
Please retest with a proper install of CUDA 12.3 (or later) on Hopper H100. There was evidently an issue/bug/defect with atomics on distributed shared memory for float and double type, prior to CUDA 12.3. As an alternative for verification of the above code on CUDA 12.0, the code will behave correctly if the data type is changed from float to e.g. int.