Cuda Debugging


I am working on a kernel that finds minimum of 3x3 sub arrays of an array.

The problem is, i am having hardtime debugging the kernel which is actually not long. It began to feel like working on 70s computer, which you have to review your code over an over again manually and desperately seeking for the possible reason. I calculated like 100 times index numbers manually for every grid, indexes for every shared memory transfers… Still end up with meaningless result file.

So i am pretty sure it is not how pros do… Can you help me about methodology of debugging in GPUs…

Thanks in advance…

Make sure to check the status of all CUDA API calls and of all kernel launches. I would suggest first running the code under control of cuda-memcheck, which can point out out-of-bounds memory accesses, race conditions, and failed API calls.

A 3x3 array seems to imply that there are only few threads in flight, or at least that you can easily configure your kernel invocation to use a few threads only. If so, you should be able to easily debug with the help of a few printf() calls in the code. Beyond that, CUDA comes with a powerful debugger, which allows you to do all the things developers normally do with debuggers, such as single stepping code and inspecting the value of variables.

Original array size is 4182x4182. I defined 17x17 threads per block. It transfers 19x19 sub arrays to shared memory than sync threads. Searches 3x3 sub arrays Then writes to result array.

I use printf mostly, but after some point it became exhaustive. I have out of bound controls at the beginning of the main part. So kernel does not crash. But outcome has notting to do with real results.

I have another kernel which reads everything from global memory. Significantly faster compare to CPU but not good enough.

The code will be easier to debug if you can scale down the problem size. If there is a fundamental algorithmic error it is likely to show up in the scaled-down version as well. In fact, for new code development (CPU or GPU, doesn’t matter) I almost always start with a tiny matrix size first, and incrementally scale up once that is all working solidly.

As for using printf(), you would want to minimize what is being printed, and consolidate the output into a log that makes it easy to see what was going in temporal sequence (for example, by indentation or markup).

Debugging on the GPU isn’t really much different than debugging a multi-threaded program on the CPU, and the same techniques apply. Sometimes, staring at the code as mentioned in your original question also helps, in particular if you do that after not having looked at the code for a few days.


This array is already a scale down. I will have huge arrays and 10s of kernels. I dont think SW guys test every function with scale downs (There is of course test feeds.)

I believe CUDA has solution for this…