Seeking an Efficient Way to Debug CUDA Kernels

This is an open post, purpose of which is to seek efficient ways to debug CUDA kernels.

I started CUDA programming during PhD, where I developed a solver for Helmholtz equation. It was the first time I realized how difficult debugging can be. My strategy was to debug a kernel of gridDim 1 and blockDim 1 (single thread) and to scale when the single thread was bug free. In this way, I could use printf to print out debug information to be analyzed later.

As I went deeper into CUDA programming, I realized the approach utilizing single thread is not a scalable method, as most of times, kernels are designed at warp/block/grid level, e.g., utilizing shared memory per block, or shuffle operations to conduct reduction. In that case, a kernel has to be analyzed at all thread, warp, block and grid levels, and debugging a scaled kernel execution becomes inevitable.

The only debugging tools I know are cuda-memcheck and cuda-gdb. cuda-memcheck is a wonderful tool, as illegal memory access is also considered a runtime error under that tool. However, this tool only provides insight into memory access, but not the correctness from kernel execution. To conduct a comprehensive CUDA debugging, cuda-gdb seems to be the only tool. Up till today, I haven’t found an efficient way to use cuda-gdb, due to the complexity of kernel multi-threading execution. In theory, every step of each thread has to be validated to ensure the correctness of a kernel or to find out the source of a bug. I am seeking efficient methods of CUDA debugging using a combination of these tools. Any advice?

Personally, I like to write my kernels with strided loops so that I can use just a single “work-group” for debugging. This can be a thread, a warp, a block, whatever the granularity of computation is.

How could a program “know” what the kernel is intended to do? You need to have tests for your code, maybe a reference CPU implementation. When I see one of my tests failing, I use kernel printf to print values of variables to find a mistake.

If your kernel performs multiple steps, try to extract functions which can be tested separatly.

To avoid some bugs, try to use libraries like CUB instead of implementing your own primitives like reductions etc.

1 Like

Frame challenge: Maybe it would be more efficient to look at the software development process.

Even with thirty years of professional software development experience (and forty years of programming in total), I still make a lot of mistakes when programming. That is not for lack of trying to avoid making mistakes. I admire people (and I know a few) who can write down 200 lines of code and have it all work flawlessly the first they compile. Unfortunately, that is not me. And while I am quite good at debugging (to the point that I have succeeded where others failed to root cause a problem), it annoys me to no end.

I use two development strategies to cope with this scenario:

  1. Build strong test scaffolding from the start (not quite “test first”, but close)
  2. Small incremental changes only (with revision control, snapping back to last known good version is trivial)

With a “smoke” test after every incremental change, it is easy to detect breakage before problems start festering. And the nature of the most recent change is still top of mind and small in extent, and with a bit of deep thought and staring at the code and failed tests it is usually fairly readily discernible what went wrong, then correct one’s mental model.

For larger projects, the above is extended with: (0.a) A formal design document (0.b.) a formalized test plan. Writing a design document is a great debugging tool. When one tries to describe how something should work at a reasonable level of detail, one finds all kind of flaws in one’s thinking, before a single line of code has been written. As for the test plan, a colleague once observed that I “appear to be going out of my way to arrive at reference results in a manner as different as possible from the actual implementation”. Which is a pretty accurate description of my methodology and often also adds additional insights.

In consequence, I rarely have engaged in “classical” debugging over the past twenty years of professional software development work, and the number of occasions where I have fired up a debugger in that time can probably be counted on the fingers of my hands. 99% of the debugging was by printf and logging only, including in parallel and distributed software. I find that the key there is to print out as little information as possible to avoid drowning in a sea of data; this requires careful thinking about which data will be most helpful in pinpointing a problem

1 Like

session 12 of this series may be of interest. I’m not sure it answers any of the questions in this thread but it does cover a basic walkthrough of cuda-gdb usage along with a few things to look out for in a multithreaded debugging environment.

At the company where I am at we use a combination of these unit testing and test automation frameworks

Google Test Framework
creates an executable that runs a series of unit tests. The test output is nicely summarized on the console. For the CUDA based tests, these tests compare a CPU generated solution against results computed on the GPU.

ctest (a cmake driven testing framework)
this toolkit is used to run multiple short simulation campaigns that have to provide identical test output to some stored reference results.

On every code commit we run these tests in an automated based testing pipeline provided by Jenkins. Any compilation or testing errors will be flagged and e-mailed to the respective committers for them to fix. This approach allows for continuous integration. The drawback is that running these tests requires a lot of computing resources on a large server park.

And I almost forgot, the Jenkins pipeline also runs the binaries through tools like valgrind and the CUDA compute sanitizer to flag any data races and memory related issues.

Most of the tools I listed focus on the aspect of detecting problems - not so much on the debugging itself. But one can only debug what one can detect!