cuda-memcheck hangs the whole system

Every now and then when I troubleshoot “illegal memory accesses” with cuda-memcheck it literally freezes the entire system to the point that the mouse pointer doesn’t move and the caps/…-lock toggles don’t respond on the keybboard. The only thing I can do is press the reset button on the computer.

It doesn’t happen after every build but the cuda-memcheck tool is a gamble.

It is so annoying as loading projects into Visual Studio takes forever after a reset.

cuda-memcheck can substantially slow down the rate of CUDA execution (in some cases by a factor of 10x or more). The result is that kernels often take longer to run under cuda-memcheck than they do normally.

A long-running kernel will noticeably impair the responsiveness of the UI if it is running on a GPU in WDDM mode that is servicing the UI. In WDDM mode, a GPU running a CUDA kernel is not servicing the display/UI while the kernel is running.

Finally, a long-running kernel can trip the TDR limit on a GPU running in WDDM mode.

Switch to running your CUDA kernels on a GPU that is in TCC mode.

You can also try adding another GPU that is not in TCC mode, but run your kernels there. As long as your display/desktop is not extended to that GPU, I believe it will not impact the UI. The TDR watchdog is still in effect, however, for any WDDM GPU in the system.

Ok, but why does it freeze other hardware? Even the caps-lock gets unresponsive and also mouse acceleration switch doesn’t alter the mouse speed indicator. Sometimes the hard-drive light on the computer gets stuck at on with seemingly no hard drive activity.

It just gets stuck like this indefinitely whereas a “successfully” run kernel finishes in a split second.

When I come to think of it, I think I have disabled the TDR in nVidia nSight.

Ok, I have given the cuda-memcheck tool quite a bit over an hour and the whole system is completely unresponsive. I don’t think it is supposed to be that way. And I have now checked that TDR indeed is disabled and has been all the time. I don’t have access to TCC mode GPU.
Debug.rar (1.36 MB)

file a bug? developer.nvidia.com

the ability to work on such a bug will be considerably enhanced by a reliable repro case.

I’ve never witnessed a situation where cuda-memcheck hangs the system, and I use it extensively. However, I make only occasional usage of windows and I rarely run on windows with TDR disabled. So when I use cuda-memcheck on windows it’s more common for me to hit the TDR.

To be clear, the UI is expected to be unresponsive while a kernel is running on a WDDM GPU.

Bug is now submitted, it remains to be seen how that will fare… I think the issues that I have are related to thread divergence and calls to __syncthreads();.

However, while it is expected to have bugs to weed out during development I see no reason for the cuda-memcheck tool to halt the whole system indefinitely while the executable terminates with an illegal memory access error within a few seconds.

Note that if your code contains an out-of-bounds memory access, the application behavior may be unpredictable and change with the execution environment, in particular if the out-of-bounds access is a load. Uninitialized data entering into a loop-bounds check could lead to an extremely long-running or even infinite loop, which in conjunction with a disabled watch-dog timer could cause indefinite kernel run-time, stopping the GUI dead in its tracks.

The contents of out-of-bounds memory is undefined and can well be different between runs with and without cuda-memcheck. In my time programming with CUDA I have run across several cases of intermittent, strange, application failures that went away simply by adding a nonsense environment variable (e.g. set foo=bar). As it turned out the data in the process environment (and the environment variables in particular) made it into the app via an out-of-bounds access causing this “weird” behavior.

Another possible error scenario is that you are invoking undefined behavior by incorrect use of __syncthreads(), e.g. __syncthreads() call in a divergent code flow. Since undefined behavior is exactly that, it may change depending on execution context including whether you are running with or without cuda-memcheck. Depending on the nature of your code, it may give rise to a kernel with infinite run time.

In practical terms, you could try to find the out-of-bounds access by code review, or follow txbob’s advice above.

I have now recompiled and run the binary with all __syncthreads(); removed. I also removed access to variables that in some circumstances may be accessed by other threads as well eliminating potential race conditions. I might use atomic operations here in the future but don’t know right now how to do it on arrays.

But still I get the same freezing behaviour from cuda-memcheck as before. I don’t really know how to troubleshoot the code to be honest. I ported the code to pure host code and only minor modifications were required to make it run just fine on the CPU.

I’ll see if I can get the code platform independent by removing all Windows specific elements and then submit it here.

Simply removing all __syncthreads() from the code sounds like a voodoo debugging technique, similar to waving a rubber chicken over your monitor.

Try reducing the problem size and adding device-side printf() to the code. I have found any number of bugs in kernels just with a simple log produced with printf(). I would suggest starting with just a couple of printf() calls to avoid overwhelming the ring-buffer used by device-side printf() to communicate with the host.

I guess I can go back to using in-device printf() calls, it’s just that when using a few thousand threads and another many thousand iterations it might not be that easy to spot the problem. I may limit the output of printf() by enclosing it in a if(threadId ==0) statement but that may not be the faltering thread. After all, I’m using cuRAND so data in one thread may be very different from data in another. At least, the GPU is SIMT after all…

The removal of __synchthreads(); was contemplated after I read this source:

Obviously you would want to limit the use of printf() to maybe just one or at most a few threads, e.g. just have thread index 0 print. It is also possible to increase the size of the buffer used by device-side printf().

In addition, it is probably a good idea to reduce the problem size, and thus presumably the grid configuration. Likewise, reduce the iteration count. You might also be able to able to selectively disable pieces of code with #ifdef and narrow down the location of the problem by bisecting the code in this manner. Or change the random numbers used in your code to well-known fixed numbers during debugging.

I would focus on the out-of-bounds access for now, as you already know that this problem exists in the code. Out-of-bounds accesses that are not of the off-by-1 type are often the result of operating on huge array indices, caused by mixing signed and unsigned data in index computations, for example.

Not sure how you got to this state of affairs. It may boil down to bugs in your code, bugs in the compiler, bugs in cuda-memcheck, or a combination thereof. Given that the CUDA development tools are quite robust these days, my initial hypothesis would be the source of the trouble are bugs in the code itself.

My recommended software development strategy (not just for CUDA, but in general), is to develop test scaffolding concurrently or ahead of the code, start small and build the code base incrementally with continuous, automated test coverage. That way any bugs can often be limited to the last code increment, and one avoids the issue of having to track down a bug “de novo” in a code based of several thousand lines or more. Even if the worst case occurs, it is possible to find the bug(s) with a systematic approach using classical debugging techniques, it may just be pretty painful. Been there, done that, got the t-shirt :-)

It’s just that it’s frustrating when the code fails so miserably on the GPU and then you almost copy and paste it into CPU code and it all works just fine. After some extensive runs with the CPU program I discovered that it leaked memory. Yes, I’m new to C/C++ programming and particularly CUDA programming. So porting it to CPU wasn’t such a bad idea after all, in fact I get twice the performance on the CPU than on the GPU, when running on 6 parallel threads that is.

Here’s what went down; let’s say that I have declared two arrays with say ‘new int’ declarations, let’s call them Array1 and Array2, then the command:

delete[] Array1, Array2;

will for some funny reason only deallocate Array1 but not Array2. I mean, the declaration “int var1, var2;” will not only declare var1 but both var1 and var2. Changing those delete lines fixed the memory leaks and made the CUDA version work to boot.

But still, why is cuda-memcheck freezing the entire system indefinitely?

Btw, where can I get those rubber chickens, gotta try that some day!

May be of interest, regarding “some funny reason”:

http://stackoverflow.com/questions/6694745/how-to-delete-multiple-dynamically-allocated-arrays-in-a-single-delete-statement

What is the bug number for the bug you filed about this:

I would never be able to handle the issues with my code if I stuck to trying to track them using printf() statements. So the CPU porting approach did pay off.

The bug ID is: 1715171

Now I wonder whether those rubber chickens come with a pulley in the middle…

I don’t own a rubber chicken and have never used one, as I don’t believe in voodoo magic. However, if you search for “rubber chicken” using your favorite internet search engine, you will instantly find offers from multiple suppliers. Apparently rubber chickens are used as comedy props and toys for dogs, besides their use as magical tools for solving computer problems :-)