Reset driver without rebooting on linux

I know this topic is not a new one. After searching online for the topic, I didn’t find a good answer.

It is inconvenient to reboot every time when drive crashed. Tesla GPUs has nvidia-smi --gpu-reset which doesn’t support on GTX 450.

My system configuration:
OS: Ubuntu 14.04.1
GPU: GTX 450
Nvidia Driver Version: 352.63

One thread suggested reload driver, I am not familiar with the kernel in Linux, and afraid to skew up the only GPU I have. Can anyone give me a pointer how to reset driver without rebooting on GTX 450.

Thank you.

If you are using the GPU for display this won’t work/you can’t reset the driver (while X is active on that GPU).

If you are not using the GPU for display you can do:

sudo rmmod nvidia

The next operation you do with the GPU will force a driver reload, but you can manually do it with e.g.:

sudo nvidia-smi

As stated elsewhere, the CUDA runtime should do a pretty good job of cleaning up without any of this as long as you kill any host processes associated with the crash session.

Thank you for your comment. txbob. I appreciate it.

The primary reason for me to reset driver is my application exceeds 2 seconds and the watchdog timer
is triggered, consequently, the driver is crashed. Rebooting the server is very inconvenient. I was thinking reset the driver without rebooting would save lots of hassle.

I test the commands with the following output.

LongY@Ubuntu:~/Desktop$ sudo rmmod nvidia
rmmod: ERROR: Module nvidia is in use by: nvidia_uvm
LongY@Ubuntu:~/Desktop$ sudo rmmod -f nvidia_uvm
LongY@Ubuntu:~/Desktop$ sudo rmmod nvidia
LongY@Ubuntu:~/Desktop$ sudo nvidia-smi
No devices were found

I also tried the below commands

LongY@Ubuntu:~/Desktop$ sudo rmmod -f nvidia
[sudo] password for LongY: 
rmmod: ERROR: ../libkmod/libkmod-module.c:769 kmod_module_remove_module() could not remove 'nvidia': Resource temporarily unavailable
rmmod: ERROR: could not remove module nvidia: Resource temporarily unavailable

This post also provides some information.

The Cuda Driver backs to normal after rebooting. I also updated the driver to Version 352.63.

It sounds like you would want to rework your app to avoid kernels that get close to the timeout limit.

Your description of the driver “crashing” when a watchdog event is trigger does not sound right to me. It used to be the case, on both Linux and Windows, that in such a situation the current CUDA context is destroyed, but the CUDA driver itself recovered. This recovery could take up to several seconds. After recovery, other CUDA apps could be run.

I ran on an RHEL-based workstation and driver recovery seemed to work quite well, although it did happen on a few occasions that after multiple consecutive timeout events, unloading and reloading of the driver as described by txbob became necessary. This required stopping X and dropping into console mode, but it did not require rebooting the machine as a whole.

So what exactly are the symptoms observed when the “driver crashes”? Is there a possibility that this is some sort of Ubuntu-specific issue? I may be biased, but I have seen too many reports of “crazy stuff” happening on Ubuntu over the years that I never observed on RHEL that I have decided to stay as far away from Ubuntu as possible.

Thank you. njuffa.

This thread is actually following the thread you commented.

The symptom is: (I decompose it step by step to give you a better view)

  1. my application runs. e.g., ./a.out
  2. It takes 2.5 seconds. I determined the CUDA driver is crashed after searching over the Internet. I might be wrong.
  3. At this point, if I try to run ./a.out or any other Cuda program, the server crashed meaning that
    I can not do anything on the server except rebooting (turn on/off the power switch of the server) it to make it back to normal. Issuing other CPU commands is fine.

Thus, I choose to avoid run any CUDA program after my application runs over 2 seconds. If I want to use the GPU again, I have to issue “sudo reboot” to reboot the server to make the GPU accessible again.

My server doesn’t connect to a display, and only use for scientific computing.

My recommendation in that thread still stands: If you seek assistance with debugging, you would want to post a minimal buildable and runable code that others can examine and run.

Note that an application running for 2.5 seconds does not necessarily mean any CUDA kernels invoked by that application hit the operating systems time-out limit and trigger the watchdog.

If a CUDA program causes a server to become completely unresponsive, including access via ssh, and requiring power-cycling I would think there are much bigger issues with that machine than anything related to CUDA per se.

Again, without access to code that actually triggers these issues, this is just a guess.