BUG: nvidia_uvm needs to be removed and re-inserted in order to work after wakeup from suspend

Dear developer, it seems that the latest nvidia-470 (470.63.01) driver has a minor glitch, that after every wake-up from suspend (especially when there are active CUDA processes either running or at breakpoint), I need to manually rmmod and modprobe the nvidia-uvm kernel driver in order to run PyTorch/Tensorflow with CUDA support. The following terminal screenshot explains the situation:

xuancong@wxc-dell:~$ ./Desktop/anaconda-python3 -c “import torch; print(torch.cuda.is_available())”
/opt/anaconda3/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
False

xuancong@wxc-dell:~$ lsmod | grep nvidia
nvidia_uvm 1032192 0
nvidia_drm 61440 2
nvidia_modeset 1196032 2 nvidia_drm
nvidia 35266560 73 nvidia_uvm,nvidia_modeset
drm_kms_helper 200704 2 nvidia_drm,i915
drm 495616 22 drm_kms_helper,nvidia,nvidia_drm,i915
xuancong@wxc-dell:~$ sudo rmmod nvidia_uvm
xuancong@wxc-dell:~$ sudo modprobe nvidia_uvm
xuancong@wxc-dell:~$ ./Desktop/anaconda-python3 -c “import torch; print(torch.cuda.is_available())”
True
xuancong@wxc-dell:~$ nvidia-smi
Sat Sep 4 16:19:40 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 Off | N/A |
| N/A 38C P8 N/A / N/A | 4MiB / 4042MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 21812 G /usr/lib/xorg/Xorg 4MiB |
±----------------------------------------------------------------------------+

1 Like

I can confirm this on a Ubuntu 20.04 and 510 also. I am pretty sure it is worth to file a bug report.

I used below configuration setup and ran bandwidthTest. I performed suspend/resume operation multiple times but could not observed any such requirement to re-insert nvidia_uvm.
Alienware Desktop + Ubuntu 20.04 LTS + Driver 470.63.01 + NVIDIA GeForce GTX 1080 Ti + Cuda 11.4

Please help to share reliable repro steps and nvidia bug report.

Thanks @amrits for testing! The bug still exists on my Dell XPS laptop with Driver Version: 510.60.02 CUDA Version: 11.6.
There are a few things you need to test:

  1. you need to test for all sleep modes: s2idle, deep, etc. (run cat /sys/power/mem_sleep to check and toggle)
  2. you need to sleep while the GPU/CUDA processes are running
  3. during sleep, try to plug/unplug power to trigger power events
  4. it seems that the bug is related to sleep time, longer sleep time, higher chances of occurring, so sleep for at least 2 hours
    Maybe the bug is only for Dell laptops, but you should test on a few different devices. One laptop definitely is not enough. Then you should be able to reproduce the bug.

Thanks!
Good luck!

this bug happens on multi-gpu machines with cuda 12 and v525 or 528. I solved it by sudo rmmod nvidia_uvm from the host. Looks like if 2 processes access this file at the same time, 1st one blocks it and 2nd one hangs and after that any call to cuda gives an error for everybody. I thought it was related to power management, but looks like it’s not. Because after restart if the nvidia_uvm file is there and 2 proc access it, I get immediately that error. On Ubuntu 20.04. So something is wrong with this. Hope that helps trace it further

The bug still exists on the latest versions of Nvidia drivers up to today. Did you suspend while CUDA process is running or at breakpoint? If I terminate all CUDA processes before suspend, there is no such issue.

I can confirm this happens to me on driver 525.125 on AMD64 Debian Sid kernel 6.5.0-2 with a GTX-1070, exactly as described after suspend.

nothing’s gonna happen if none of you provide the nvidia-bug-report

I have filed a bug 4343535 internally for tracking purpose.
Would request to share nvidia bug report from repro state so that I can match configuration as close as possible.

Hi,
Can someone please share nvidia bug report from repro state so that I can match configuration as close as possible.

Hi @amrits , I was having the exact same issue on two machines (one laptop, one PC). I tested the state with the following code:

#include <stdio.h>
#include <cuda.h>

int main() {
    CUresult ret = cuInit(0);

    if (ret != CUDA_SUCCESS) {
        fprintf(stderr, "cuInit failed! Error code: %d\n", ret);
        return 1;
    }

    printf("CUDA initialized successfully!\n");

    return 0;
}

Compiled and ran with this:

cc test_cuda.c -lcuda -I/opt/cuda/include && ./a.out

I get the following after I suspend once:

cuInit failed! Error code: 999

If I run sudo modprobe -r nvidia_uvm && sudo modprobe nvidua_uvm it does work. I just now enabled options nvidia NVreg_PreserveVideoMemoryAllocations=1 in /etc/modprobe/nvidia.conf and enabled nvidia-suspend.service and nvidia-hibernate.service (as per Arch wiki). That seems to have fixed my issue, I no longer need to unload and load nvidia_uvm anymore. Though I tested this only briefly, if things change I will let you know.

In case it helps, attached is an nvidia-bug-report from before I made those changes.
nvidia-bug-report.log.gz (1.8 MB)

I was able to repro issue locally but wanted to check if anyone knows the last passing driver.

root@oemqa-Alienware-m17:~/cuda-samples/Samples/1_Utilities/bandwidthTest# ./bandwidthTest
[CUDA Bandwidth Test] - Starting…
Running on…

cudaGetDeviceProperties returned 999
→ unknown error
CUDA error at bandwidthTest.cu:256 code=999(cudaErrorUnknown) “cudaSetDevice(currentDevice)”

No clue when this last worked, I remember this being an issue for a long long time (years?).

Update from my side: since last post I haven’t needed to reload nvidia_uvm after suspending. Seems to work as intended. Hope it helps finding the root cause.

Update after multiple months, I still don’t see this issue anymore. I haven’t tried reverting the changes I made, but something seems to work ;)

I am seeing this on the latest Nvidia drivers (up to version 530.30.02). However, it becomes less frequent. So there is some improvement. But the random bug remains there.

I think this one is tough bug because it seems to be hardware dependent. So far I encounter this issue only on one of my machines (a Dell laptop).

I also have the exact same bug, your Dell laptop isn’t an XPS machine by any chance. Hopefully if there is any developers that are willing to solve it I could provide logs.

I have to reset my drivers with

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

every once in a while, I haven’t quite been able to pinpoint if there’s anything in particular that triggers the issue.

here’s my relevant nvidia-smi output
| NVIDIA-SMI 550.76 Driver Version: 550.76 CUDA Version: 12.4 |
| 0 NVIDIA GeForce RTX 4070

some possibly relevant system info
OS: EndeavourOS Linux x86_64
Kernel: 6.8.7-arch1-2
Uptime: 1 day, 16 hours, 7 mins
Packages: 1399 (pacman), 7 (flatpak)
Shell: bash 5.2.26
Resolution: 1920x1080
DE: Plasma 6.0.4
WM: KWin
Theme: [Plasma], Breeze-Dark [GTK2], Breeze [GTK3]
Icons: ePapirus-Dark [Plasma], ePapirus-Dark [GTK2/3]
Terminal: yakuake
CPU: 13th Gen Intel i5-13400F (16) @ 4.600GHz
GPU: NVIDIA GeForce RTX 4070
Memory: 19151MiB / 96374MiB

I keep long-running machine learning and crypto mining software running, and sometimes they fail suddenly. The commands above let me restart the software and they begin working fine again for a time

In addition to this issue, now I also experience ~30%-chance random hard crash when resuming from suspend (system LED lights up, signaling system is resuming, but screen does not light up, keyboard/mouse not responding even with Ctrl+Alt+F4/Ctrl+Alt+Del, system dead-locked), when Nvidia driver is running. FYI, wake-up never failed after rmmod nvidia. Latest versions of Nvidia driver is pretty shitty.

My system info:
Ubuntu 22.04.4 LTS
Linux 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Nvidia Driver Version: 550.67 CUDA Version: 12.4
Machine: Gigabyte AERO 16 OLED BSF
KDE Plasma: 5.24.7

I had this problem before, with my recent installation of driver 565 the problem seems to be gone.

My system info:

System: Kernel: 6.5.0-1mx-ahs-amd64 [6.5.3-1~mx23ahs] arch: x86_64 bits: 64 compiler: gcc v: 12.2.0 parameters: BOOT_IMAGE=/boot/vmlinuz-6.5.0-1mx-ahs-amd64 root=UUID=<filter> ro quiet splash Desktop: Xfce v: 4.18.1 tk: Gtk v: 3.24.36 info: xfce4-panel wm: xfwm v: 4.18.0 vt: 7 dm: LightDM v: 1.26.0 Distro: MX-23.4_ahs_x64 Libretto October 15 2023 base: Debian GNU/Linux 12 (bookworm) Machine: Type: Desktop Mobo: ASUSTeK model: PRIME A320M-K v: Rev X.0x serial: <superuser required> UEFI: American Megatrends v: 6231 date: 08/31/2024

Graphics: Device-1: NVIDIA GA106 [GeForce RTX 3060 Lite Hash Rate] vendor: Micro-Star MSI driver: nvidia v: 565.57.01 alternate: nouveau,nvidia_drm non-free: 530.xx+ status: current (as of 2023-03) arch: Ampere code: GAxxx process: TSMC n7 (7nm) built: 2020-22 pcie: gen: 3 speed: 8 GT/s lanes: 16 link-max: gen: 4 speed: 16 GT/s ports: active: none off: HDMI-A-1 empty: DP-1,DP-2,DP-3 bus-ID: 07:00.0 chip-ID: 10de:2504 class-ID: 0300

I have this python code (see below) at the start of my GPU-dependent python apps just in case. It keeps coming back servicecode “1”, a good sign. My cuda-dependent AI code is running as expected:

# with Linux, need to ensure the nvidia-uvm module is loaded every time
check_uvm = subprocess.run(["lsmod","|","grep","-q","nvidia_uvm"])

print("check_uvm.returncode before subprocess:",check_uvm.returncode)

if check_uvm.returncode == 0:
    subprocess.run(["echo", "nvidia_uvm module is not loaded. Loading module."])
    subprocess.run(["sudo", "modprobe", "nvidia_uvm"])
else:
    subprocess.run(["echo", "nvidia_uvm module is loaded."])
uvm_process = subprocess.run(["lsmod","|","grep","-q","nvidia_uvm"])
print("uvm_process.returncode after subprocess:",uvm_process.returncode)

Note: For some reason the 565 driver, as of this writing 2024-11-20T06:00:00Z, does not come with nvidia-smi. To monitor my GPU performance, I am now using the flatpack called “mission center”. The command “gpustat” is another approach.

@rallan_md Thanks for reporting!

Since this is a persistent bug (it has been there for almost 8 years) that also randomly causes system to crash upon resuming from suspend, you also need to test many times with both short and long duration system suspend.

1 Like