I am using Debian testing on a laptop with an nvidia GPU and I’ve installed the nvidia drivers using nvidia-driver-full package. I have done no other modifications. I am able to use the gpu with no problems, but when I run nvidia-smi, it uses all of the ram and swap on my system. If I wait long enough, the system comes back to normal and the nvidia-smi info is displayed. nvidia-smi acts normally if I do valgrind nvidia-smi, however.
Thanks for posting this, I thought I was going crazy. I am experiencing the same thing - I think. Did you already find a solution?
I’m running Debian testing - fully up to date and have tried both the Debian nvidia packages as well as the latest packages from nvidia’s site. Both installs have the same results.
When I run nvidia-smi it quickly uses up all of my RAM (64 GB) and is then killed - I guess by the OOM killer. I do not have swap, so it can’t fill that up.
After crashing there is a backtrace and data dump into dmesg. Running with strace also crashes and creates a log. Running with valgrind runs normally, as you said.
I’m using a GeForce RTX 3070. I just got it, so I wasn’t sure if I had misconfigured something or if something was broken on either Debian’s part or Nvidia’s part.
I have not found a solution yet. I’ve been using a bash alias to call nvidia-smi using valgrind and I called it a day:
# nvidia-smi eats all ram/swap unless run with valgrind
# not sure what the problem is, but this is an easier solution
function run-nvidia-smi {
valgrind nvidia-smi "$@" 2> /dev/null
}
alias nvidia-smi="run-nvidia-smi"
Sorry, there is no update yet. I have not able to reproduce this on my systems with Debian testing. Please capture a NVIDIA bug report after you hit this issue and attach it here. I will check this again and file a bug for tracking.
I saw the following message [EDIT - in the kernel logs] on my Debian testing system during the last round of testing -
`__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation`
but nvidia-smi runs without getting killed.
I saw a momentary increase in memory usage on running nvidia-smi. It did not increase indefinitely. If there are any additional configuration changes required to reproduce this, please let me know.
How are you getting this message? What version of Debian are you actually using? When I run strace -v -tt nvidia-smi as @developer.nvidia.com26 did, I get the same message and result, where all my ram is being used by nvidia-smi.
__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation
Have you though about what that error message means? That vm_enough_memory stopped you from allocating that much memory because you didn’t have enough? Maybe that would cause nvidia-smi to stop that ridiculous memory allocation and continue normally? Maybe you should look into that error message instead of posting it and shrugging your shoulders.
Thank you for the feedback. We will try this out on the latest Debian testing image. I have filed a bug to track this internally at NVBug #4833179. I will share Engineering feedback when available.
Ran into this myself. My system was slow and my RAM completely consumed. Finally realized it was nvidia-smi; another tool was calling it in a loop and causing RAM to flood.
Best part (not shown): subsequent runs don’t use whatever nvidia-smi cached; it starts fresh, pegging a CPU core to fill all RAM.
zoey@Clippy:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.10
Release: 24.10
Codename: oracular
zoey@Clippy:~$ nvidia-smi
Thu Oct 10 22:08:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 On | N/A |
| 0% 51C P8 21W / 350W | 75MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 118489 G /usr/bin/gnome-shell 68MiB |
+-----------------------------------------------------------------------------------------+
Running sudo chmod o-w /var/run/nvidia-persistenced/socket as suggested by developer.nvidia.com26 above makes nvidia-smi instantly respond without consuming everything in sight. However, I severely dislike magical fixes and would love something that doesn’t require me to fiddle with files like this to work around bugs.
Same issue here on Debian testing. It’s been going on for quite some time. Persistence daemon is enabled. nvidia-smi 535.183.06 continues to try to map over 50GB of memory every time it runs. My machine has enough memory so it eventually runs as expected but the usage is ridiculous.
Note also doing brittle things like chmod are just going to break again when there’s a driver update. Better to just systemctl disable nvidia-persistencedso it sticks for the time being. Startup time for nvidia-smi will still be slowed if the GPU is not in use by anything since that’s the whole point of persistenced.
This bug also exists on Ubuntu 24.10 with driver 560…35.3 but was not a problem on Ubuntu 24.04 and driver 535.183.06 (which was used before upgrade to 24.10). Present Linux kernel is 6.11.0.9.
This was an issue with our nvidia-persistenced service on the latest Debian and Ubuntu test images. Engineering has identified the problem and submitted a fix. We have verified the fix on our systems.
Unfortunately, the fix is high-risk and requires a full QA test cycle at our end. This fix will be available in a future production branch.
Setting a lower limit for maximum number of open file descriptors can also be used as a potential workaround until the fix is available :- # ulimit -Hn 16777216 or # ulimit -Hn 524288 instead of the current default value of 1073741816 (on Debian test, Ubuntu 24.10 nightly).
I will update this thread when the fix is available on a release driver.
I was experiencing the same issue with wayland/hyprland/sddm and a nvidia p400,
only the sudo chmod o-w /var/run/nvidia-persistenced/socket fix seemed to help, none of the other fixes in this thread seemed to do anything, (even the ulimit fix, even when scaled down to my memory size, was not effective)
Originally, it was using all 32 gigs of my ram, and 80gb of my swap file.