Originally published at: https://developer.nvidia.com/blog/efficient-cuda-debugging-memory-initialization-and-thread-synchronization-with-nvidia-compute-sanitizer/
NVIDIA Compute Sanitizer (NCS) is a powerful tool that can save you time and effort while improving the reliability and performance of your CUDA applications. In our previous post, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitzer, we explored efficient debugging in the realm of parallel programming. We discussed how debugging code…
Hi!
I have a question related to the last example (Synchronization checking).
Compiling the provided code with minor modifications (adding #include
<stdio.h>
, threadId
→ threadID
, __syncThreads
→ __syncthreads
and sumVaules
→ sumValues
) it compiles and gives the desired result (120), when it is supposed to give 0.
$nvcc -lineinfo -gencode arch=compute_89,code=compute_89 ballot_example.cu -o ballot_example
$./ballot_example
Sum out = 120
But when I use compute-sanitizer it doesn’t,
$compute-sanitizer --tool synccheck --show-backtrace no ./ballot_example
========= COMPUTE-SANITIZER
========= Barrier error detected. Invalid arguments.
========= at __syncwarp(unsigned int)+0xb0 in /usr/local/cuda/targets/x86_64-linux/include/sm_30_intrinsics.hpp:110
========= by thread (0,0,0) in block (0,0,0)
...
========= Barrier error detected. Invalid arguments.
========= at __syncwarp(unsigned int)+0xb0 in /usr/local/cuda/targets/x86_64-linux/include/sm_30_intrinsics.hpp:110
========= by thread (16,0,0) in block (0,0,0)
=========
Sum out = 0
========= ERROR SUMMARY: 17 errors
I have NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6, running on a NVIDIA GeForce RTX 4060.
1) Is there any particular reason for this behavior while I was expecting 0 (as the post says)?
Documentation says that result is undefined but I’m getting it right (even with the <=).
I know this post has educational purposes (not performance), but
2) What are the advantages and disadvantages of declaring shared memory outside the kernel scope, and in what scenarios should this approach be used?
Thanks again for your time and great posts.