Dear developer, it seems that the latest nvidia-470 (470.63.01) driver has a minor glitch, that after every wake-up from suspend (especially when there are active CUDA processes either running or at breakpoint), I need to manually rmmod and modprobe the nvidia-uvm kernel driver in order to run PyTorch/Tensorflow with CUDA support. The following terminal screenshot explains the situation:
xuancong@wxc-dell:~$ ./Desktop/anaconda-python3 -c “import torch; print(torch.cuda.is_available())”
/opt/anaconda3/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
False
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 21812 G /usr/lib/xorg/Xorg 4MiB |
±----------------------------------------------------------------------------+
I used below configuration setup and ran bandwidthTest. I performed suspend/resume operation multiple times but could not observed any such requirement to re-insert nvidia_uvm.
Alienware Desktop + Ubuntu 20.04 LTS + Driver 470.63.01 + NVIDIA GeForce GTX 1080 Ti + Cuda 11.4
Please help to share reliable repro steps and nvidia bug report.
Thanks @amrits for testing! The bug still exists on my Dell XPS laptop with Driver Version: 510.60.02 CUDA Version: 11.6.
There are a few things you need to test:
you need to test for all sleep modes: s2idle, deep, etc. (run cat /sys/power/mem_sleep to check and toggle)
you need to sleep while the GPU/CUDA processes are running
during sleep, try to plug/unplug power to trigger power events
it seems that the bug is related to sleep time, longer sleep time, higher chances of occurring, so sleep for at least 2 hours
Maybe the bug is only for Dell laptops, but you should test on a few different devices. One laptop definitely is not enough. Then you should be able to reproduce the bug.
this bug happens on multi-gpu machines with cuda 12 and v525 or 528. I solved it by sudo rmmod nvidia_uvm from the host. Looks like if 2 processes access this file at the same time, 1st one blocks it and 2nd one hangs and after that any call to cuda gives an error for everybody. I thought it was related to power management, but looks like it’s not. Because after restart if the nvidia_uvm file is there and 2 proc access it, I get immediately that error. On Ubuntu 20.04. So something is wrong with this. Hope that helps trace it further
The bug still exists on the latest versions of Nvidia drivers up to today. Did you suspend while CUDA process is running or at breakpoint? If I terminate all CUDA processes before suspend, there is no such issue.
I have filed a bug 4343535 internally for tracking purpose.
Would request to share nvidia bug report from repro state so that I can match configuration as close as possible.
cc test_cuda.c -lcuda -I/opt/cuda/include && ./a.out
I get the following after I suspend once:
cuInit failed! Error code: 999
If I run sudo modprobe -r nvidia_uvm && sudo modprobe nvidua_uvm it does work. I just now enabled options nvidia NVreg_PreserveVideoMemoryAllocations=1 in /etc/modprobe/nvidia.conf and enabled nvidia-suspend.service and nvidia-hibernate.service (as per Arch wiki). That seems to have fixed my issue, I no longer need to unload and load nvidia_uvm anymore. Though I tested this only briefly, if things change I will let you know.
In case it helps, attached is an nvidia-bug-report from before I made those changes. nvidia-bug-report.log.gz (1.8 MB)
No clue when this last worked, I remember this being an issue for a long long time (years?).
Update from my side: since last post I haven’t needed to reload nvidia_uvm after suspending. Seems to work as intended. Hope it helps finding the root cause.
I am seeing this on the latest Nvidia drivers (up to version 530.30.02). However, it becomes less frequent. So there is some improvement. But the random bug remains there.
I think this one is tough bug because it seems to be hardware dependent. So far I encounter this issue only on one of my machines (a Dell laptop).
I also have the exact same bug, your Dell laptop isn’t an XPS machine by any chance. Hopefully if there is any developers that are willing to solve it I could provide logs.
some possibly relevant system info
OS: EndeavourOS Linux x86_64
Kernel: 6.8.7-arch1-2
Uptime: 1 day, 16 hours, 7 mins
Packages: 1399 (pacman), 7 (flatpak)
Shell: bash 5.2.26
Resolution: 1920x1080
DE: Plasma 6.0.4
WM: KWin
Theme: [Plasma], Breeze-Dark [GTK2], Breeze [GTK3]
Icons: ePapirus-Dark [Plasma], ePapirus-Dark [GTK2/3]
Terminal: yakuake
CPU: 13th Gen Intel i5-13400F (16) @ 4.600GHz
GPU: NVIDIA GeForce RTX 4070
Memory: 19151MiB / 96374MiB
I keep long-running machine learning and crypto mining software running, and sometimes they fail suddenly. The commands above let me restart the software and they begin working fine again for a time
In addition to this issue, now I also experience ~30%-chance random hard crash when resuming from suspend (system LED lights up, signaling system is resuming, but screen does not light up, keyboard/mouse not responding even with Ctrl+Alt+F4/Ctrl+Alt+Del, system dead-locked), when Nvidia driver is running. FYI, wake-up never failed after rmmod nvidia. Latest versions of Nvidia driver is pretty shitty.
My system info:
Ubuntu 22.04.4 LTS
Linux 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Nvidia Driver Version: 550.67 CUDA Version: 12.4
Machine: Gigabyte AERO 16 OLED BSF
KDE Plasma: 5.24.7
I have this python code (see below) at the start of my GPU-dependent python apps just in case. It keeps coming back servicecode “1”, a good sign. My cuda-dependent AI code is running as expected:
# with Linux, need to ensure the nvidia-uvm module is loaded every time
check_uvm = subprocess.run(["lsmod","|","grep","-q","nvidia_uvm"])
print("check_uvm.returncode before subprocess:",check_uvm.returncode)
if check_uvm.returncode == 0:
subprocess.run(["echo", "nvidia_uvm module is not loaded. Loading module."])
subprocess.run(["sudo", "modprobe", "nvidia_uvm"])
else:
subprocess.run(["echo", "nvidia_uvm module is loaded."])
uvm_process = subprocess.run(["lsmod","|","grep","-q","nvidia_uvm"])
print("uvm_process.returncode after subprocess:",uvm_process.returncode)
Note: For some reason the 565 driver, as of this writing 2024-11-20T06:00:00Z, does not come with nvidia-smi. To monitor my GPU performance, I am now using the flatpack called “mission center”. The command “gpustat” is another approach.
Since this is a persistent bug (it has been there for almost 8 years) that also randomly causes system to crash upon resuming from suspend, you also need to test many times with both short and long duration system suspend.