Running cuda code on many devices, driver crashes

Hi,

I’ve written some code code which runs on the same machine (Windows 10, 64 bit) concurrently, which has many graphics cards installed.

It’s working well and I am able to run a single instance of my code per graphics card device. However after a few hours one of the instances crashes, and brings down all the other instances of the application too.

(it looks like the graphics card driver crashes, it goes to a black screen for a bit and comes back after around 30 seconds)

Has anybody experienced this behaviour? Any tips for debugging?

The test machine is not my development machine, but I am considering installing the compiler and running it from there. I’ve read articles about turning on the windows minidump for such things, but I’m not convinced this will lead to an effective diagnostic.

I’m wondering if I’m experiencing some sort of buffer overflow, writing to bits of GPU memory I have no right changing. i.e. the bit of the code causing the problem may not appear in the minidump, just the end result of the glitch.

Thanks in advance.

Regards Phill.

Hmmm. I’ve decided to jerry rig a fast console output window, when it crashes it at least shows me the last message before the crash.

Also commenting out code in a process of elimination, which is a bit of a bore.

Random crashes after a few hours of running on a production machine are the worst.

Are all of the GPU in the system of the same type? Is it possible the execution time of any of the CUDA kernels in the software is close to the operating system’s watch dog timer limit (typically around 2 seconds) on the slowest GPU in the system? The profiler will help pin-point long-running kernels.

If the watchdog timer limit is exceed, the operating system will force a graphics driver reset, and I would think this will destroy the CUDA contexts for all the GPUs in the system. Recovery time after a watchdog timer limit varies, I have seen from anywhere from 2 seconds to a minute (30 seconds squarely falls into that range).

Thanks for the reply!

All GPU’s the same. However, I did pluck up the courage to enable mini dumps on the operating system and tracked it down to some dodgy code running the CPU, although it doesn’t explain the graphics drivers obviously resetting on occasion. But every since I fixed up that issue I haven’t had it crashing anymore.

watch dog timeout is an interesting thought, I saw it when I first enabled GPU debugging with breakpoints, but I don’t think any of my kernels have that long an execution time. I did notice some odd behaviour when I put logging debug messages into the system, the slower it went the more frequent the error would be, rather than several hours of waiting I could make it crash fairly quickly. But again, this was tracked down to sloppy CPU code.

Maybe there’s still an underlying bug there lurking…

mini dumps instructions;