Reset driver without rebooting on linux

LongY · December 12, 2015, 6:08am

I know this topic is not a new one. After searching online for the topic, I didn’t find a good answer.

It is inconvenient to reboot every time when drive crashed. Tesla GPUs has nvidia-smi --gpu-reset which doesn’t support on GTX 450.

My system configuration:
OS: Ubuntu 14.04.1
GPU: GTX 450
Nvidia Driver Version: 352.63

One thread suggested reload driver, I am not familiar with the kernel in Linux, and afraid to skew up the only GPU I have. Can anyone give me a pointer how to reset driver without rebooting on GTX 450.

Thank you.

Robert_Crovella · December 12, 2015, 2:13pm

If you are using the GPU for display this won’t work/you can’t reset the driver (while X is active on that GPU).

If you are not using the GPU for display you can do:

sudo rmmod nvidia

The next operation you do with the GPU will force a driver reload, but you can manually do it with e.g.:

sudo nvidia-smi

As stated elsewhere, the CUDA runtime should do a pretty good job of cleaning up without any of this as long as you kill any host processes associated with the crash session.

LongY · December 14, 2015, 2:38am

Thank you for your comment. txbob. I appreciate it.

The primary reason for me to reset driver is my application exceeds 2 seconds and the watchdog timer
is triggered, consequently, the driver is crashed. Rebooting the server is very inconvenient. I was thinking reset the driver without rebooting would save lots of hassle.

I test the commands with the following output.

LongY@Ubuntu:~/Desktop$ sudo rmmod nvidia
rmmod: ERROR: Module nvidia is in use by: nvidia_uvm
LongY@Ubuntu:~/Desktop$ sudo rmmod -f nvidia_uvm
LongY@Ubuntu:~/Desktop$ sudo rmmod nvidia
LongY@Ubuntu:~/Desktop$ sudo nvidia-smi
No devices were found

I also tried the below commands

LongY@Ubuntu:~/Desktop$ sudo rmmod -f nvidia
[sudo] password for LongY: 
rmmod: ERROR: ../libkmod/libkmod-module.c:769 kmod_module_remove_module() could not remove 'nvidia': Resource temporarily unavailable
rmmod: ERROR: could not remove module nvidia: Resource temporarily unavailable

This post also provides some information.
http://askubuntu.com/questions/611146/how-to-restart-nvidia-cuda-driver-without-rebooting

The Cuda Driver backs to normal after rebooting. I also updated the driver to Version 352.63.

njuffa · December 14, 2015, 3:18am

It sounds like you would want to rework your app to avoid kernels that get close to the timeout limit.

Your description of the driver “crashing” when a watchdog event is trigger does not sound right to me. It used to be the case, on both Linux and Windows, that in such a situation the current CUDA context is destroyed, but the CUDA driver itself recovered. This recovery could take up to several seconds. After recovery, other CUDA apps could be run.

I ran on an RHEL-based workstation and driver recovery seemed to work quite well, although it did happen on a few occasions that after multiple consecutive timeout events, unloading and reloading of the driver as described by txbob became necessary. This required stopping X and dropping into console mode, but it did not require rebooting the machine as a whole.

So what exactly are the symptoms observed when the “driver crashes”? Is there a possibility that this is some sort of Ubuntu-specific issue? I may be biased, but I have seen too many reports of “crazy stuff” happening on Ubuntu over the years that I never observed on RHEL that I have decided to stay as far away from Ubuntu as possible.

LongY · December 14, 2015, 3:59am

Thank you. njuffa.

This thread is actually following the thread you commented.
[url]cudaMemcpy takes more than 2 seconds, then driver crashed. - CUDA Programming and Performance - NVIDIA Developer Forums.

The symptom is: (I decompose it step by step to give you a better view)

my application runs. e.g., ./a.out
It takes 2.5 seconds. I determined the CUDA driver is crashed after searching over the Internet. I might be wrong.
At this point, if I try to run ./a.out or any other Cuda program, the server crashed meaning that
I can not do anything on the server except rebooting (turn on/off the power switch of the server) it to make it back to normal. Issuing other CPU commands is fine.

Thus, I choose to avoid run any CUDA program after my application runs over 2 seconds. If I want to use the GPU again, I have to issue “sudo reboot” to reboot the server to make the GPU accessible again.

My server doesn’t connect to a display, and only use for scientific computing.

njuffa · December 14, 2015, 4:22am

My recommendation in that thread still stands: If you seek assistance with debugging, you would want to post a minimal buildable and runable code that others can examine and run.

Note that an application running for 2.5 seconds does not necessarily mean any CUDA kernels invoked by that application hit the operating systems time-out limit and trigger the watchdog.

If a CUDA program causes a server to become completely unresponsive, including access via ssh, and requiring power-cycling I would think there are much bigger issues with that machine than anything related to CUDA per se.

Again, without access to code that actually triggers these issues, this is just a guess.

Topic		Replies	Views
Making sure all previous versions of CUDA are gone (Drivers randomly fail on reboot) Linux cuda	1	704	January 10, 2021
Nvidia driver crashing on restart (cuda 11.7.1/11.8) CUDA Setup and Installation	1	575	March 30, 2023
Driver 410.57 for NVIDIA GeForce RTX 2080 Ti causes reboot, the run installer reports no driver installed Linux cuda , kernel	10	635	May 5, 2024
Hard crash using CUDA on GTX 1080 Ti on Ubuntu 16.04 CUDA Setup and Installation	8	4846	September 25, 2017
Nvidia driver not loaded after reboot, but loaded after shutdown and boot Drivers - Linux, Windows, MacOS kernel , fedora	8	2783	September 24, 2024
Running CUDA programs without starting X server CUDA Programming and Performance	8	8713	December 8, 2020
Is there a way to reset a GPU?... ...that is, without rebooting Linux CUDA Programming and Performance	7	2884	October 20, 2010
Computation crash = stuck at 574mhz CUDA Programming and Performance	9	1277	August 4, 2015
Reset dedicated GPU after it gets stuck Linux cuda , linux , nvidia-smi	7	20361	August 30, 2023
Ubuntu 22.04.1 Nvidia Driver (Open Kernel) Nvidia-Driver-515-Open Issue Linux kernel	14	22750	November 19, 2022

Reset driver without rebooting on linux

Related topics