cuda-memcheck hangs the whole system

UID · December 17, 2015, 12:24am

Every now and then when I troubleshoot “illegal memory accesses” with cuda-memcheck it literally freezes the entire system to the point that the mouse pointer doesn’t move and the caps/…-lock toggles don’t respond on the keybboard. The only thing I can do is press the reset button on the computer.

It doesn’t happen after every build but the cuda-memcheck tool is a gamble.

It is so annoying as loading projects into Visual Studio takes forever after a reset.

Robert_Crovella · December 17, 2015, 12:33am

cuda-memcheck can substantially slow down the rate of CUDA execution (in some cases by a factor of 10x or more). The result is that kernels often take longer to run under cuda-memcheck than they do normally.

A long-running kernel will noticeably impair the responsiveness of the UI if it is running on a GPU in WDDM mode that is servicing the UI. In WDDM mode, a GPU running a CUDA kernel is not servicing the display/UI while the kernel is running.

Finally, a long-running kernel can trip the TDR limit on a GPU running in WDDM mode.

Switch to running your CUDA kernels on a GPU that is in TCC mode.

You can also try adding another GPU that is not in TCC mode, but run your kernels there. As long as your display/desktop is not extended to that GPU, I believe it will not impact the UI. The TDR watchdog is still in effect, however, for any WDDM GPU in the system.

UID · December 17, 2015, 2:04am

Ok, but why does it freeze other hardware? Even the caps-lock gets unresponsive and also mouse acceleration switch doesn’t alter the mouse speed indicator. Sometimes the hard-drive light on the computer gets stuck at on with seemingly no hard drive activity.

It just gets stuck like this indefinitely whereas a “successfully” run kernel finishes in a split second.

When I come to think of it, I think I have disabled the TDR in nVidia nSight.

UID · December 18, 2015, 8:14pm

Ok, I have given the cuda-memcheck tool quite a bit over an hour and the whole system is completely unresponsive. I don’t think it is supposed to be that way. And I have now checked that TDR indeed is disabled and has been all the time. I don’t have access to TCC mode GPU.
Debug.rar (1.36 MB)

Robert_Crovella · December 18, 2015, 8:29pm

file a bug? developer.nvidia.com

the ability to work on such a bug will be considerably enhanced by a reliable repro case.

I’ve never witnessed a situation where cuda-memcheck hangs the system, and I use it extensively. However, I make only occasional usage of windows and I rarely run on windows with TDR disabled. So when I use cuda-memcheck on windows it’s more common for me to hit the TDR.

To be clear, the UI is expected to be unresponsive while a kernel is running on a WDDM GPU.

UID · December 23, 2015, 9:26pm

Bug is now submitted, it remains to be seen how that will fare… I think the issues that I have are related to thread divergence and calls to __syncthreads();.

However, while it is expected to have bugs to weed out during development I see no reason for the cuda-memcheck tool to halt the whole system indefinitely while the executable terminates with an illegal memory access error within a few seconds.

njuffa · December 23, 2015, 10:04pm

Note that if your code contains an out-of-bounds memory access, the application behavior may be unpredictable and change with the execution environment, in particular if the out-of-bounds access is a load. Uninitialized data entering into a loop-bounds check could lead to an extremely long-running or even infinite loop, which in conjunction with a disabled watch-dog timer could cause indefinite kernel run-time, stopping the GUI dead in its tracks.

The contents of out-of-bounds memory is undefined and can well be different between runs with and without cuda-memcheck. In my time programming with CUDA I have run across several cases of intermittent, strange, application failures that went away simply by adding a nonsense environment variable (e.g. set foo=bar). As it turned out the data in the process environment (and the environment variables in particular) made it into the app via an out-of-bounds access causing this “weird” behavior.

Another possible error scenario is that you are invoking undefined behavior by incorrect use of __syncthreads(), e.g. __syncthreads() call in a divergent code flow. Since undefined behavior is exactly that, it may change depending on execution context including whether you are running with or without cuda-memcheck. Depending on the nature of your code, it may give rise to a kernel with infinite run time.

In practical terms, you could try to find the out-of-bounds access by code review, or follow txbob’s advice above.

UID · December 23, 2015, 10:34pm

I have now recompiled and run the binary with all __syncthreads(); removed. I also removed access to variables that in some circumstances may be accessed by other threads as well eliminating potential race conditions. I might use atomic operations here in the future but don’t know right now how to do it on arrays.

But still I get the same freezing behaviour from cuda-memcheck as before. I don’t really know how to troubleshoot the code to be honest. I ported the code to pure host code and only minor modifications were required to make it run just fine on the CPU.

I’ll see if I can get the code platform independent by removing all Windows specific elements and then submit it here.

njuffa · December 23, 2015, 11:56pm

Simply removing all __syncthreads() from the code sounds like a voodoo debugging technique, similar to waving a rubber chicken over your monitor.

Try reducing the problem size and adding device-side printf() to the code. I have found any number of bugs in kernels just with a simple log produced with printf(). I would suggest starting with just a couple of printf() calls to avoid overwhelming the ring-buffer used by device-side printf() to communicate with the host.

UID · December 24, 2015, 12:46am

I guess I can go back to using in-device printf() calls, it’s just that when using a few thousand threads and another many thousand iterations it might not be that easy to spot the problem. I may limit the output of printf() by enclosing it in a if(threadId ==0) statement but that may not be the faltering thread. After all, I’m using cuRAND so data in one thread may be very different from data in another. At least, the GPU is SIMT after all…

The removal of __synchthreads(); was contemplated after I read this source:

http://stackoverflow.com/questions/6666382/can-i-use-syncthreads-after-having-dropped-threads

njuffa · December 24, 2015, 1:44am

Obviously you would want to limit the use of printf() to maybe just one or at most a few threads, e.g. just have thread index 0 print. It is also possible to increase the size of the buffer used by device-side printf().

In addition, it is probably a good idea to reduce the problem size, and thus presumably the grid configuration. Likewise, reduce the iteration count. You might also be able to able to selectively disable pieces of code with #ifdef and narrow down the location of the problem by bisecting the code in this manner. Or change the random numbers used in your code to well-known fixed numbers during debugging.

I would focus on the out-of-bounds access for now, as you already know that this problem exists in the code. Out-of-bounds accesses that are not of the off-by-1 type are often the result of operating on huge array indices, caused by mixing signed and unsigned data in index computations, for example.

Not sure how you got to this state of affairs. It may boil down to bugs in your code, bugs in the compiler, bugs in cuda-memcheck, or a combination thereof. Given that the CUDA development tools are quite robust these days, my initial hypothesis would be the source of the trouble are bugs in the code itself.

My recommended software development strategy (not just for CUDA, but in general), is to develop test scaffolding concurrently or ahead of the code, start small and build the code base incrementally with continuous, automated test coverage. That way any bugs can often be limited to the last code increment, and one avoids the issue of having to track down a bug “de novo” in a code based of several thousand lines or more. Even if the worst case occurs, it is possible to find the bug(s) with a systematic approach using classical debugging techniques, it may just be pretty painful. Been there, done that, got the t-shirt :-)

UID · December 29, 2015, 11:12pm

It’s just that it’s frustrating when the code fails so miserably on the GPU and then you almost copy and paste it into CPU code and it all works just fine. After some extensive runs with the CPU program I discovered that it leaked memory. Yes, I’m new to C/C++ programming and particularly CUDA programming. So porting it to CPU wasn’t such a bad idea after all, in fact I get twice the performance on the CPU than on the GPU, when running on 6 parallel threads that is.

Here’s what went down; let’s say that I have declared two arrays with say ‘new int’ declarations, let’s call them Array1 and Array2, then the command:

delete[] Array1, Array2;

will for some funny reason only deallocate Array1 but not Array2. I mean, the declaration “int var1, var2;” will not only declare var1 but both var1 and var2. Changing those delete lines fixed the memory leaks and made the CUDA version work to boot.

But still, why is cuda-memcheck freezing the entire system indefinitely?

Btw, where can I get those rubber chickens, gotta try that some day!

Robert_Crovella · December 29, 2015, 11:24pm

May be of interest, regarding “some funny reason”:

http://stackoverflow.com/questions/6694745/how-to-delete-multiple-dynamically-allocated-arrays-in-a-single-delete-statement

What is the bug number for the bug you filed about this:

UID · December 31, 2015, 4:58am

I would never be able to handle the issues with my code if I stuck to trying to track them using printf() statements. So the CPU porting approach did pay off.

The bug ID is: 1715171

Now I wonder whether those rubber chickens come with a pulley in the middle…

njuffa · December 31, 2015, 5:23am

I don’t own a rubber chicken and have never used one, as I don’t believe in voodoo magic. However, if you search for “rubber chicken” using your favorite internet search engine, you will instantly find offers from multiple suppliers. Apparently rubber chickens are used as comedy props and toys for dogs, besides their use as magical tools for solving computer problems :-)

Topic		Replies	Views
cuda-memcheck versus cuda-racecheck CUDA Programming and Performance	6	969	August 6, 2014
Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure? CUDA Programming and Performance	6	768	March 20, 2018
Problem with MEMCHECK CUDA Programming and Performance	0	474	May 25, 2013
CUDA causes system freeze system has to be reset to work again ... CUDA Programming and Performance	4	11117	November 30, 2009
Cuda-memcheck.exe weird behavior on Tesla C2075 (even on CUDA examples). CUDA Programming and Performance	0	1938	May 30, 2012
memcheck different error every run CUDA Programming and Performance	0	365	September 28, 2017
Getting around apparent CUDA bugs CUDA Programming and Performance	5	995	September 20, 2011
cuda-memcheck.exe caused an illegal memory access error. CUDA-MEMCHECK	0	1828	October 27, 2016
Using cudaMemCheck Legacy PGI Compilers	3	6486	November 21, 2013
My GPU code works well in cuda-memcheck model but crashes in normal running model CUDA Programming and Performance	2	704	October 9, 2015

cuda-memcheck hangs the whole system

Related topics