BUG: nvidia_uvm needs to be removed and re-inserted in order to work after wakeup from suspend

xuancong84 · September 4, 2021, 8:22am

Dear developer, it seems that the latest nvidia-470 (470.63.01) driver has a minor glitch, that after every wake-up from suspend (especially when there are active CUDA processes either running or at breakpoint), I need to manually rmmod and modprobe the nvidia-uvm kernel driver in order to run PyTorch/Tensorflow with CUDA support. The following terminal screenshot explains the situation:

xuancong@wxc-dell:~$ ./Desktop/anaconda-python3 -c “import torch; print(torch.cuda.is_available())”
/opt/anaconda3/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
False

xuancong@wxc-dell:~$ lsmod | grep nvidia
nvidia_uvm 1032192 0
nvidia_drm 61440 2
nvidia_modeset 1196032 2 nvidia_drm
nvidia 35266560 73 nvidia_uvm,nvidia_modeset
drm_kms_helper 200704 2 nvidia_drm,i915
drm 495616 22 drm_kms_helper,nvidia,nvidia_drm,i915
xuancong@wxc-dell:~$ sudo rmmod nvidia_uvm
xuancong@wxc-dell:~$ sudo modprobe nvidia_uvm
xuancong@wxc-dell:~$ ./Desktop/anaconda-python3 -c “import torch; print(torch.cuda.is_available())”
True
xuancong@wxc-dell:~$ nvidia-smi
Sat Sep 4 16:19:40 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 Off | N/A |
| N/A 38C P8 N/A / N/A | 4MiB / 4042MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 21812 G /usr/lib/xorg/Xorg 4MiB |
±----------------------------------------------------------------------------+

thomas.hartwig · April 16, 2022, 7:52am

I can confirm this on a Ubuntu 20.04 and 510 also. I am pretty sure it is worth to file a bug report.

amrits · April 18, 2022, 11:50am

I used below configuration setup and ran bandwidthTest. I performed suspend/resume operation multiple times but could not observed any such requirement to re-insert nvidia_uvm.
Alienware Desktop + Ubuntu 20.04 LTS + Driver 470.63.01 + NVIDIA GeForce GTX 1080 Ti + Cuda 11.4

Please help to share reliable repro steps and nvidia bug report.

xuancong84 · April 19, 2022, 1:58am

Thanks @amrits for testing! The bug still exists on my Dell XPS laptop with Driver Version: 510.60.02 CUDA Version: 11.6.
There are a few things you need to test:

you need to test for all sleep modes: s2idle, deep, etc. (run cat /sys/power/mem_sleep to check and toggle)
you need to sleep while the GPU/CUDA processes are running
during sleep, try to plug/unplug power to trigger power events
it seems that the bug is related to sleep time, longer sleep time, higher chances of occurring, so sleep for at least 2 hours
Maybe the bug is only for Dell laptops, but you should test on a few different devices. One laptop definitely is not enough. Then you should be able to reproduce the bug.

Thanks!
Good luck!

aertonda · March 9, 2023, 8:44pm

this bug happens on multi-gpu machines with cuda 12 and v525 or 528. I solved it by sudo rmmod nvidia_uvm from the host. Looks like if 2 processes access this file at the same time, 1st one blocks it and 2nd one hangs and after that any call to cuda gives an error for everybody. I thought it was related to power management, but looks like it’s not. Because after restart if the nvidia_uvm file is there and 2 proc access it, I get immediately that error. On Ubuntu 20.04. So something is wrong with this. Hope that helps trace it further

xuancong84 · March 10, 2023, 12:27am

The bug still exists on the latest versions of Nvidia drivers up to today. Did you suspend while CUDA process is running or at breakpoint? If I terminate all CUDA processes before suspend, there is no such issue.

mprz1024 · October 22, 2023, 11:21pm

I can confirm this happens to me on driver 525.125 on AMD64 Debian Sid kernel 6.5.0-2 with a GTX-1070, exactly as described after suspend.

opisalwaysafag · October 23, 2023, 6:18am

nothing’s gonna happen if none of you provide the nvidia-bug-report

amrits · October 23, 2023, 10:29am

I have filed a bug 4343535 internally for tracking purpose.
Would request to share nvidia bug report from repro state so that I can match configuration as close as possible.

amrits · October 30, 2023, 11:22am

Hi,
Can someone please share nvidia bug report from repro state so that I can match configuration as close as possible.

hansg91 · November 18, 2023, 11:40pm

Hi @amrits , I was having the exact same issue on two machines (one laptop, one PC). I tested the state with the following code:

#include <stdio.h>
#include <cuda.h>

int main() {
    CUresult ret = cuInit(0);

    if (ret != CUDA_SUCCESS) {
        fprintf(stderr, "cuInit failed! Error code: %d\n", ret);
        return 1;
    }

    printf("CUDA initialized successfully!\n");

    return 0;
}

Compiled and ran with this:

cc test_cuda.c -lcuda -I/opt/cuda/include && ./a.out

I get the following after I suspend once:

cuInit failed! Error code: 999

If I run sudo modprobe -r nvidia_uvm && sudo modprobe nvidua_uvm it does work. I just now enabled options nvidia NVreg_PreserveVideoMemoryAllocations=1 in /etc/modprobe/nvidia.conf and enabled nvidia-suspend.service and nvidia-hibernate.service (as per Arch wiki). That seems to have fixed my issue, I no longer need to unload and load nvidia_uvm anymore. Though I tested this only briefly, if things change I will let you know.

In case it helps, attached is an nvidia-bug-report from before I made those changes.
nvidia-bug-report.log.gz (1.8 MB)

amrits · November 20, 2023, 3:17pm

I was able to repro issue locally but wanted to check if anyone knows the last passing driver.

root@oemqa-Alienware-m17:~/cuda-samples/Samples/1_Utilities/bandwidthTest# ./bandwidthTest
[CUDA Bandwidth Test] - Starting…
Running on…

cudaGetDeviceProperties returned 999
→ unknown error
CUDA error at bandwidthTest.cu:256 code=999(cudaErrorUnknown) “cudaSetDevice(currentDevice)”

hansg91 · December 7, 2023, 9:18am

No clue when this last worked, I remember this being an issue for a long long time (years?).

Update from my side: since last post I haven’t needed to reload nvidia_uvm after suspending. Seems to work as intended. Hope it helps finding the root cause.

hansg91 · March 6, 2024, 8:58am

Update after multiple months, I still don’t see this issue anymore. I haven’t tried reverting the changes I made, but something seems to work ;)

xuancong84 · March 6, 2024, 9:42am

I am seeing this on the latest Nvidia drivers (up to version 530.30.02). However, it becomes less frequent. So there is some improvement. But the random bug remains there.

I think this one is tough bug because it seems to be hardware dependent. So far I encounter this issue only on one of my machines (a Dell laptop).

user38608 · April 7, 2024, 7:45pm

I also have the exact same bug, your Dell laptop isn’t an XPS machine by any chance. Hopefully if there is any developers that are willing to solve it I could provide logs.

nvidia2960 · June 1, 2024, 3:06pm

I have to reset my drivers with

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

every once in a while, I haven’t quite been able to pinpoint if there’s anything in particular that triggers the issue.

here’s my relevant nvidia-smi output
| NVIDIA-SMI 550.76 Driver Version: 550.76 CUDA Version: 12.4 |
| 0 NVIDIA GeForce RTX 4070

some possibly relevant system info
OS: EndeavourOS Linux x86_64
Kernel: 6.8.7-arch1-2
Uptime: 1 day, 16 hours, 7 mins
Packages: 1399 (pacman), 7 (flatpak)
Shell: bash 5.2.26
Resolution: 1920x1080
DE: Plasma 6.0.4
WM: KWin
Theme: [Plasma], Breeze-Dark [GTK2], Breeze [GTK3]
Icons: ePapirus-Dark [Plasma], ePapirus-Dark [GTK2/3]
Terminal: yakuake
CPU: 13th Gen Intel i5-13400F (16) @ 4.600GHz
GPU: NVIDIA GeForce RTX 4070
Memory: 19151MiB / 96374MiB

I keep long-running machine learning and crypto mining software running, and sometimes they fail suddenly. The commands above let me restart the software and they begin working fine again for a time

xuancong84 · June 3, 2024, 2:14am

In addition to this issue, now I also experience ~30%-chance random hard crash when resuming from suspend (system LED lights up, signaling system is resuming, but screen does not light up, keyboard/mouse not responding even with Ctrl+Alt+F4/Ctrl+Alt+Del, system dead-locked), when Nvidia driver is running. FYI, wake-up never failed after rmmod nvidia. Latest versions of Nvidia driver is pretty shitty.

My system info:
Ubuntu 22.04.4 LTS
Linux 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Nvidia Driver Version: 550.67 CUDA Version: 12.4
Machine: Gigabyte AERO 16 OLED BSF
KDE Plasma: 5.24.7

rallan_md · November 20, 2024, 11:19pm

I had this problem before, with my recent installation of driver 565 the problem seems to be gone.

My system info:

System: Kernel: 6.5.0-1mx-ahs-amd64 [6.5.3-1~mx23ahs] arch: x86_64 bits: 64 compiler: gcc v: 12.2.0 parameters: BOOT_IMAGE=/boot/vmlinuz-6.5.0-1mx-ahs-amd64 root=UUID=<filter> ro quiet splash Desktop: Xfce v: 4.18.1 tk: Gtk v: 3.24.36 info: xfce4-panel wm: xfwm v: 4.18.0 vt: 7 dm: LightDM v: 1.26.0 Distro: MX-23.4_ahs_x64 Libretto October 15 2023 base: Debian GNU/Linux 12 (bookworm) Machine: Type: Desktop Mobo: ASUSTeK model: PRIME A320M-K v: Rev X.0x serial: <superuser required> UEFI: American Megatrends v: 6231 date: 08/31/2024

Graphics: Device-1: NVIDIA GA106 [GeForce RTX 3060 Lite Hash Rate] vendor: Micro-Star MSI driver: nvidia v: 565.57.01 alternate: nouveau,nvidia_drm non-free: 530.xx+ status: current (as of 2023-03) arch: Ampere code: GAxxx process: TSMC n7 (7nm) built: 2020-22 pcie: gen: 3 speed: 8 GT/s lanes: 16 link-max: gen: 4 speed: 16 GT/s ports: active: none off: HDMI-A-1 empty: DP-1,DP-2,DP-3 bus-ID: 07:00.0 chip-ID: 10de:2504 class-ID: 0300

I have this python code (see below) at the start of my GPU-dependent python apps just in case. It keeps coming back servicecode “1”, a good sign. My cuda-dependent AI code is running as expected:

# with Linux, need to ensure the nvidia-uvm module is loaded every time
check_uvm = subprocess.run(["lsmod","|","grep","-q","nvidia_uvm"])

print("check_uvm.returncode before subprocess:",check_uvm.returncode)

if check_uvm.returncode == 0:
    subprocess.run(["echo", "nvidia_uvm module is not loaded. Loading module."])
    subprocess.run(["sudo", "modprobe", "nvidia_uvm"])
else:
    subprocess.run(["echo", "nvidia_uvm module is loaded."])
uvm_process = subprocess.run(["lsmod","|","grep","-q","nvidia_uvm"])
print("uvm_process.returncode after subprocess:",uvm_process.returncode)

Note: For some reason the 565 driver, as of this writing 2024-11-20T06:00:00Z, does not come with nvidia-smi. To monitor my GPU performance, I am now using the flatpack called “mission center”. The command “gpustat” is another approach.

xuancong84 · November 21, 2024, 3:46am

@rallan_md Thanks for reporting!

Since this is a persistent bug (it has been there for almost 8 years) that also randomly causes system to crash upon resuming from suspend, you also need to test many times with both short and long duration system suspend.

Topic		Replies	Views
/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions Linux cuda , opencl , linux-driver	2	2979	August 27, 2023
Nvidia-uvm module bug on suspend Linux	14	1701	December 7, 2023
'No devices were found' after installing cuda 11.02 on Ubuntu 20.04 for RTX3080 Linux cuda , ubuntu , driver	19	12567	July 31, 2021
Nvidia-drm Failed to map when waking up on Ubuntu 23.10 GPU - Hardware ubuntu	8	1173	January 10, 2024
NVIDIA driver is not loaded. Ubuntu 18.10 Linux	310	129435	February 14, 2024
Ubuntu 22.04.1 Nvidia Driver (Open Kernel) Nvidia-Driver-515-Open Issue Linux kernel	14	22297	November 19, 2022
Black screen after install of nvidia driver ubuntu Linux	224	159635	February 27, 2025
[INFO]: Finished with code: 256 , [ERROR]: Install of driver component failed CUDA Setup and Installation	24	178063	September 29, 2024
NVidia driver 520.61.05 / Cuda 11.8 / RTX 3090 = black display and superslow modesets Linux cuda , ubuntu	21	23981	December 6, 2022
BLACK SCREEN at the reboot after install nVidia driver 375.26 Geforce GT630 Ubuntu 16.04 64bit... Linux	40	32488	April 21, 2017

BUG: nvidia_uvm needs to be removed and re-inserted in order to work after wakeup from suspend

Related topics