Context
I am integrating a custom module/plugin for an SDK that is for an asynchronous CUDA application. When I use compute sanitizer on our code (in our own examples, env, etc. independent of the application’s code) there are no reported issues with mem-check or race-check. Our code even works and doesn’t report any errors in the SDK’s build environment using the SDK’s dependencies, i.e. inside their docker container.
When I use compute sanitizer in the integrated module inside the SDK of the asynchronous CUDA application, the program typically throws a non deterministic memory related error. The worst part is that sometimes the code even works. I hypothesize that the overhead of something is allowing for the race conditions to settle in the correct order or something is blocking the race conditions to happen in the correct order.
I’m not sure what information I should be printing, but when checking for memory related issues using printf tactics, everything seems okay if the program gets past that step.
Below are the two commands I’m using.
path_to_compute_sanitizer/compute-sanitizer\
--log-file sanitizer_output.txt \
--leak-check full \
--tool memcheck \
/program
path_to_compute_sanitizer\compute-sanitizer\
--log-file racecheck_output.txt \
--tool racecheck \
/program
I am a relatively new CUDA developer, so the various debugging tools and paradigms are unknown to me. I’ve been reading lots of blogs/posts like,
- c - Best way to print information when debugging a race condition - Stack Overflow,
- Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer | NVIDIA Technical Blog
- Efficient CUDA Debugging: Using NVIDIA Compute Sanitizer with NVIDIA Tools Extension and Creating Custom Tools | NVIDIA Technical Blog.
Any advice or information you recommend me to read/learn would be greatly appreciated.
System
Linux Ubuntu
NVIDIA DRIVER 535
Docker Env
CUDA 12.1
GCC/G++ 7.5
CMake 3.18
Issue
When using a standard debugger like GDB, after integration the code fails, the error varies so it’s hard to say what the issue is, but I know it’s memory related and it’s likely also race condition related.
The error occurs in different places and also change, but on average it typically occurs on the first CUDA call. This CUDA call can be anything. For example it could be “cudaDeviceSynchronize()”, “cudaMalloc()”, “cudaMemGetInfo()”. However, the error is typically always memory related (see Errors section).
Errors
Using GDB
a) corrupted double-linked list
b) malloc(): corrupted top size
c) malloc(): invalid size (unsorted)
d) Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Using compute-sanitizer
When using compute-sanitizer, the application typically doesn’t crash gracefully, causing compute-sanitizer to not log/report correctly.
...
========= Error: process didn't terminate successfully
========= Target application returned an error
...
Attempts
The scariest attempt, is that I’ve tried to write some really minimal examples to reproduce the error, but in the smaller examples the error doesn’t occur (given the number of times I ran it). I plan to look into this more to make sure that the smaller example is working properly.
Initially I went through some code refactors, undoing some C style optimizations and utilizing more and more C++ style code. However, this didn’t seem to change the behavior of the bug at all.
I’ve tried to narrow down the cause by simplifying what happens in the code. However, as mentioned, typically the program fails on the first CUDA API call and then the program doesn’t crash gracefully, so it’s really hard to get more information.
Current Goal
Currently I am looking into solutions that would allow me to probe the state(s) of the application to see if I can actually get any information on the race condition that might be occurring before it actually crashes (I don’t even know what the exact cause is yet). It would be extremely useful to determine the source of the bug.
Questions
Working with compute-sanitizer and how to deal with application crashes
Are there other tools besides compute-sanitizer, or is compute-sanitizer the most ideal tool to use for debugging memory related issues in a CUDA module for an asynchronous CUDA application? How can I address the fact that when the application crashes/force quits/freezes, compute-sanitizer doesn’t finish whatever it’s trying to do?
General tips on debugging CUDA programs
Other than calling it a “race condition”, how can I further narrow down the type of bug, and what is actually happening?
How can I get more information about what is happening? Is there a way I can get enough information before the crash to visualize what’s happening? Is there a hacky way I can check the program’s conditions before it crashes, like is just printing out the related information enough? Although, I don’t know what information I should be printing out because I think the bug is not happening where the error is thrown due to the race condition. I also don’t have access to the source code only access to the SDK.