Diagnosing CUDA causing a hard system crash

(I’m not sure if this is a programming issue or a setup issue, since I haven’t been able to narrow it down at all.)

My application is crashing my main development system. It always happens: when processing starts, mouse clicks and the keyboard don’t do anything for about ten seconds (the mouse cursor moves), then the system hard freezes (the mouse cursor stops moving). This started happening after coming back from a couple weeks hiatus on this project, so I don’t know what changed to cause it.

I’ve tried narrowing down the cause, which became very confusing:

  • I reinstalled Windows 7 from scratch. The issue still happened.
  • I put my GPU into another Windows 7 PC. The issue didn’t happen. (This suggests an issue other than the GPU.)
  • I put my newly installed Windows 7 drive into the other PC. The crash is happening again.

This doesn’t make any sense. It seems like a software issue, since the crash follows the Windows installation around, but it’s a brand new installation.

I’m at a loss here. I’m on a new PC now, and I may need to do yet another complete reinstallation, but that’s grasping at straws and could be a big waste of time. There’s no system crash, so there are no system crash logs/bugchecks to look at. The application works perfectly fine on the other PC with its original Windows installation, so I don’t think it’s an issue with my application.

The Windows watchdog timer is disabled. That does cause a hard crash if it triggers, but I know what that crash looks like (two of my monitors flashing black and green to give me a seizure, speakers blaring in a DMA loop, followed by a hard crash–that’s not this crash).

If it helps: When I tried stepping through my code until I hit the crash, it happened when I wrote to managed memory. However, it didn’t always happen in the same place. I didn’t try this too many times, since single stepping a crash that freezes the whole system is incredibly painful and didn’t look like it was helping.

Any suggestions to track this down? I have to switch systems to do any CUDA development, which is bringing my project to a halt.

Is it only this particular application that causes the lockup, or does any CUDA app (e.g. CUDA samples?) cause a lockup? If you think it’s restricted to managed memory, does a managed memory CUDA sample app also cause the problem?

Which version of CUDA are you using, and which GPU driver, and which GPU?

A clean windows install does not disable the watchdog timer. So obviously you’ve done some things to it other than the minimum necessary:

-load OS
-load Visual Studio
-load CUDA

What things besides the above do you do to the installation?

If you can develop a short, simple application that reproduces the issue, you can at least have others look at it to give you another opinion on its validity. I understand that you can run it on another system and it doesn’t crash, but that isn’t a 100% guarantee of valid behavior.

An invalid application should never cause a hard, undebuggable system crash. Isolating it would be difficult, due to the nature of the issue (a hard reboot for each test), but I’ll give it a try, and run through the sample applications to see if any of them trigger it. I saw the crash when I single stepped over a write to managed memory, but other times it stepped past that point. Nobody else who’s used the application have reported any system crashes.

It happens in CUDA 6.5 and 7. Geforce 750 Ti, currently 347.88 drivers. Of course I installed other typical things: hardware drivers, Maya (the CUDA application is a Maya plugin), VS2013, etc. I wasn’t expecting the issue to keep happening through a system reinstall (or if it did, I expected it to isolate to hardware), so I was proceeding with a normal desktop installation. I’m hoping someone might recognize a known issue before I spend a bunch of time doing another reinstall. (Also, if a user has this issue, I’d have no idea what to tell them.)

From your description, it doesn’t sound like the system crashed. It sounds more like the GUI is frozen due to a long running CUDA kernel. Preventing that is the task of the operating system’s watchdog timer. It is not clear whether the watchdog timer is presently enabled (default) or disabled on your new system.

An alternative scenario I have seen is that there is a long running kernel which is killed by the watch dog timer after several seconds, but due to lack of a proper status check in the application, that same kernel is almost immediately called again. This gives every appearance of the GUI having gone dead for long periods of time.

There is a possibility that there could be a driver issue with recovery after a watch dog timer kernel termination. I have seen this a few times on Windows with preliminary drivers. A single watchdog timer event would lead to a recovery time measured in several minutes, during which time the GUI was completely unresponsive. I don’t recall encountering such issues with released WHQL drivers.

I think the best course of action is to eliminate excessive kernel run time. At minimum this will make it much easier to figure out what is going on. When the GUI freezes during debugging it is not conducive to forward progress.

It happened when single-stepping in the debugger over simply setting a managed pointer. You can’t even modify managed memory when kernels are running, and cudaThreadSynchronize() is called immediately before this write to ensure that’s the case.

(This isn’t the only place it happened–it doesn’t seem to be deterministic, unfortunately, but it happened here at least once or twice at this point when I was explicitly stepping through to see where it happened.)

I’ll try to binary search the trigger a bit today.

I haven’t yet been able to repro it outside of Maya (possibly because the standalone test exits immediately, and the crash doesn’t happen for a couple seconds), but inside a trivial plugin, this is all it takes:

#include <cuda.h>

__device__ __managed__ const void *managed_ptr;
__global__ void kernel()
{
}

void test()
{
    cudaDeviceSynchronize();
    managed_ptr = NULL;
    cudaDeviceSynchronize();

    kernel<<<16, 1>>>();
    cudaDeviceSynchronize();
}

It probably doesn’t matter, but here’s the plugin stub that’s calling it:

#include <maya/MPxCommand.h>
#include <maya/MFnPlugin.h>
using namespace std;

void test();

class TestCommand : public MPxCommand
{
public:
    MStatus doIt(const MArgList &arg)
    {
        test();
        return MS::kSuccess;
    }

    bool isUndoable() const { return false; }
    static void *creator() { return new TestCommand(); }
};

MStatus initializePlugin(MObject obj)
{
    MFnPlugin plugin(obj, "", "1.0", "Any");
    return plugin.registerCommand("testCommand", TestCommand::creator);
}

MStatus uninitializePlugin(MObject obj)
{
    MFnPlugin plugin(obj);
    return plugin.deregisterCommand("testCommand");
}

I tried removing the use of managed memory (in the full application), and the problem still happens. I don’t think this is an issue being caused by my code. I guess the next step is to try replacing the GPU…

The same issue happens with a different card (switched an EVGA 750 Ti with a Gigabyte 750). This seems like a driver issue.

It also happens if I run on one card while the other card is being used for video. The symptoms are different, but it looks like the same issue. (Instead of hard crashing immediately, that GPU seems to get wedged, the application freezes and can’t be killed, and the system froze a couple minutes later when I tried to use CUDA again.) This one did give a minidump, so at least there’s something for a bug report.

I’m not sure what else I can try beyond reinstalling Windows again.