Running cuda code on many devices, driver crashes

ohIdunno · December 20, 2016, 11:40am

Hi,

I’ve written some code code which runs on the same machine (Windows 10, 64 bit) concurrently, which has many graphics cards installed.

It’s working well and I am able to run a single instance of my code per graphics card device. However after a few hours one of the instances crashes, and brings down all the other instances of the application too.

(it looks like the graphics card driver crashes, it goes to a black screen for a bit and comes back after around 30 seconds)

Has anybody experienced this behaviour? Any tips for debugging?

The test machine is not my development machine, but I am considering installing the compiler and running it from there. I’ve read articles about turning on the windows minidump for such things, but I’m not convinced this will lead to an effective diagnostic.

I’m wondering if I’m experiencing some sort of buffer overflow, writing to bits of GPU memory I have no right changing. i.e. the bit of the code causing the problem may not appear in the minidump, just the end result of the glitch.

Thanks in advance.

Regards Phill.

ohIdunno · December 22, 2016, 8:15pm

Hmmm. I’ve decided to jerry rig a fast console output window, when it crashes it at least shows me the last message before the crash.

Also commenting out code in a process of elimination, which is a bit of a bore.

Random crashes after a few hours of running on a production machine are the worst.

njuffa · December 22, 2016, 9:54pm

Are all of the GPU in the system of the same type? Is it possible the execution time of any of the CUDA kernels in the software is close to the operating system’s watch dog timer limit (typically around 2 seconds) on the slowest GPU in the system? The profiler will help pin-point long-running kernels.

If the watchdog timer limit is exceed, the operating system will force a graphics driver reset, and I would think this will destroy the CUDA contexts for all the GPUs in the system. Recovery time after a watchdog timer limit varies, I have seen from anywhere from 2 seconds to a minute (30 seconds squarely falls into that range).

ohIdunno · January 5, 2017, 9:59am

Thanks for the reply!

All GPU’s the same. However, I did pluck up the courage to enable mini dumps on the operating system and tracked it down to some dodgy code running the CPU, although it doesn’t explain the graphics drivers obviously resetting on occasion. But every since I fixed up that issue I haven’t had it crashing anymore.

watch dog timeout is an interesting thought, I saw it when I first enabled GPU debugging with breakpoints, but I don’t think any of my kernels have that long an execution time. I did notice some odd behaviour when I put logging debug messages into the system, the slower it went the more frequent the error would be, rather than several hours of waiting I could make it crash fairly quickly. But again, this was tracked down to sloppy CPU code.

Maybe there’s still an underlying bug there lurking…

mini dumps instructions;

Topic		Replies	Views
Cuda timeout and crash CUDA Programming and Performance	1	904	July 17, 2009
CUDA Driver crashing CUDA Programming and Performance	3	1418	June 2, 2011
Watchdog timer kills CUDA code Legacy PGI Compilers	5	15371	May 14, 2013
Bluescreen while running CUDA kernel CUDA Programming and Performance	5	7693	July 8, 2009
Crashes - display driver recovers Cuda program causes card to give up. CUDA Programming and Performance	4	3801	June 21, 2011
CUDA Timeout? CUDA Programming and Performance	7	27649	December 19, 2011
How to recover CUDA after the display driver has crashed and recovered(caused by cuda crash)? CUDA Programming and Performance	7	1515	October 23, 2014
Multiple GPU's of different types CUDA Programming and Performance	3	3374	July 18, 2008
CUDA program crashes PC - sometimes! CUDA Programming and Performance	5	687	January 4, 2019
Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure? CUDA Programming and Performance	6	737	March 20, 2018

Running cuda code on many devices, driver crashes

Related topics