Hitting card memory limit crashes code and hangs card

Hi.

I am having a problem with GPUs being left in a hung state. The code I am running deploys multiple GPUs within an openMP parallel loop. The code works fine for a single thread, however when running multiple cards I sometimes get a crash due to memory problems. This would be fine, except that often one or more cards get left in a hung state. Rebooting the machine fixes this, however the machine is part of a cluster so this is not a very satisfactory solution. Is there some way of restarting graphics cards from the command line if they end up in a jammed state?

Rob.

Hi Rob,

If you’re in Linux, you can try running (as root):

modprobe -vr nvidia

to unload the driver, followed by:

modprobe -v nvidia

to start it again.

You will need to kill off any running processes using the GPU before hand, and in some cases if something has the GPU tied up you won’t be able to unload the driver. At that point the only solution I know of is to restart the system.

Great - thanks for this. I have managed to persuade our system administrator to give me root access for now. Bit annoying that they hang so often - is this common?

Rob.

While I don’t know if it’s common for others, my device hangs periodically as well. Like you, it occurs more often when I’m running code that hits a device limit or encounters some other error.

  • Mat

Hi Mat. Bit annoying, but now I have root access I can reboot as is necessary. My next thread shows why I keep hitting problems…

Thanks,

Rob.