Computation crash = stuck at 574mhz

blade613x · August 2, 2015, 1:37pm

Some times when my computations crash during debugging and a thrust memory exception occurs, my GPU becomes stuck at 574mhz. Is there any way to get it “unstuck” without rebooting or forcing the driver to crash? I typically run computations on multiple GPUs at once.

These crashes can occur anywhere from 20 minutes to 12 hours into a computation, or never (oh the joy of debugging!), so I’d like maximum performance at all times to identify what exactly is happening that is causing every variable to blow up to infinity, crash my computation, cause thrust to get an exception, and have my GPU stuck at 574mhz.

little_jimmy · August 2, 2015, 1:59pm

“or forcing the driver to crash?”

good heavens, how do you manage this?

“Is there any way to get it “unstuck” without rebooting or forcing the driver to crash? I typically run computations on multiple GPUs at once.”

if you find one, kindly let me know

i suppose that would be a grand RFE: an api that can reset the device, close to a “shutdown and restart”
i also think this is long overdue

i do not want to blow your bubble, but i doubt whether you are going to find such an “unstucker”
personally, i would therefore focus on mechanisms and methods to identify the cause as much, and as quickly as possible
you may have to build a debug version - a version with extra redundancy for purposes of debugging, predominantly maintained for debugging
for example, one option may be to have the debug version push a trace into memory, such that, if a crash occurs at the point 12 hours 1 sec, the program would be able to recommence at 12 hours in a flash

little_jimmy · August 2, 2015, 2:53pm

“I typically run computations on multiple GPUs at once.”

how do you distribute the work?

i have found that, when i distribute the work more aggressively, such that kernels/ devices work on smaller sub-problems or ‘work sets’, and more frequently retire completed work and accept new work, i generally arrive at errors more quickly, and a lot earlier on in the program, should there be any errors left

Clochette · August 3, 2015, 11:38pm

If you’re using Windows, go to the Device Manager, disable the GPU and then enable it.

Robert_Crovella · August 4, 2015, 3:30am

nvidia-smi -r is intended to reset a GPU, although it requires root privilege and cannot be used (AFAIK) to reset the “primary” GPU (which I think means a GPU driving a display).

Also, in linux, if you do not have X loaded on the GPU in question, you can do something like

rmmod nvidia

After that, the next CUDA activity should force a driver reload, which should reset the GPU. (this method also requires root privilege)

little_jimmy · August 4, 2015, 4:45am

so, you can not revive a crashed device from within your application…?
the closest solution to this is running a script from within your application?

this seems to champion a ‘make sure it never breaks; immediately fix it when it does’ approach

Robert_Crovella · August 4, 2015, 5:08am

From within the application:

cudaDeviceReset()

little_jimmy · August 4, 2015, 6:35am

i have to (re)check that - if i am not mistaken, i have been told that cudaDeviceReset only resets the context; cudaDeviceReset is insufficient to revive a device with ‘mad-card-disease’

but apart from that, i would think that the reset is only half of the problem/ solution
what about the monitoring?

do (compute) cards (running complex code) more often than not, or hardly ever, enter a state of perpetual insanity?
is the case of blade613x really such an uncommon case?
what about the context of servers/ clusters?
what if blade613x did not pick this up, and it occurred in the field?

Robert_Crovella · August 4, 2015, 3:01pm

Since I don’t know what mad-card-disease is (or perpetual insanity), I don’t know what is sufficient to revive it.

I’d be willing to bet that none of these methods works in every case. It may be that a reboot is necessary. Isn’t this true of PCs in general?

There are failures that occur in server clusters, with or without GPUs. The checkpoint/restart evolution long preceded GPUs. And sure, checkpointing is used in some cases for orderly shutdowns, but in many cases it is used to recover from that unexplained/unexpected crash.

I was just trying to offer some suggestions of things to try.

little_jimmy · August 4, 2015, 3:37pm

“Since I don’t know what mad-card-disease is (or perpetual insanity)”

simply what blade613x, and surely others, have experienced - the distinct case of, should your device hit a bug that you are not aware of, it may crash and become unstable to the point that little less than a reset is necessary to revive it, (and you may not even know that a device is ‘down’)

"I was just trying to offer some suggestions of things to try. "

a) and it was taken in no other way
b) no offense intended
c) your input is appreciated

my view is simply that i perceive the matter to be sufficiently common to warrant further investigation (by nvidia)

Topic		Replies	Views
Computation crash = stuck at 574mhz CUDA Programming and Performance	0	513	August 2, 2015
Computation crash = stuck at 574mhz CUDA Programming and Performance	0	480	August 2, 2015
"dead" reset cuda device when debugging CUDA-GDB	4	2817	April 3, 2014
is there any easy ways to reset GPU CUDA app hang up CUDA Programming and Performance	7	3650	November 20, 2008
Reset dedicated GPU after it gets stuck Linux cuda , linux , nvidia-smi	7	20080	August 30, 2023
Inexpiable CUDA hang (NOT WDM timeout!) CUDA Programming and Performance	2	1478	June 5, 2014
GPU in a bad state - only power cycle helps CUDA Programming and Performance	6	2125	March 24, 2011
Speed problem on 295 gtx cards CUDA Programming and Performance	19	10484	January 8, 2010
Simulate GPU Failure CUDA Programming and Performance	1	1244	May 23, 2016
GPU in state where results are not reproducible! CUDA Programming and Performance	50	16700	November 2, 2012

Computation crash = stuck at 574mhz

Related topics