Nvidia-smi uses all of ram and swap

I am using Debian testing on a laptop with an nvidia GPU and I’ve installed the nvidia drivers using nvidia-driver-full package. I have done no other modifications. I am able to use the gpu with no problems, but when I run nvidia-smi, it uses all of the ram and swap on my system. If I wait long enough, the system comes back to normal and the nvidia-smi info is displayed. nvidia-smi acts normally if I do valgrind nvidia-smi, however.

Has anyone else experienced this?

2 Likes

Hello,

Thanks for posting this, I thought I was going crazy. I am experiencing the same thing - I think. Did you already find a solution?

I’m running Debian testing - fully up to date and have tried both the Debian nvidia packages as well as the latest packages from nvidia’s site. Both installs have the same results.

When I run nvidia-smi it quickly uses up all of my RAM (64 GB) and is then killed - I guess by the OOM killer. I do not have swap, so it can’t fill that up.

After crashing there is a backtrace and data dump into dmesg. Running with strace also crashes and creates a log. Running with valgrind runs normally, as you said.

I’m using a GeForce RTX 3070. I just got it, so I wasn’t sure if I had misconfigured something or if something was broken on either Debian’s part or Nvidia’s part.

I have not found a solution yet. I’ve been using a bash alias to call nvidia-smi using valgrind and I called it a day:

# nvidia-smi eats all ram/swap unless run with valgrind
# not sure what the problem is, but this is an easier solution
function run-nvidia-smi {
    valgrind nvidia-smi "$@" 2> /dev/null
}
alias nvidia-smi="run-nvidia-smi"
1 Like

Has anyone found a solution to this yet? It still seems to be affecting me on Debian testing.

For me the problem started occuring after upgrading my Ubuntu-installation from Noble to Oracular 1-2 weeks ago.

Running strace gives the following indication on where the problem occurs:


$ strace -v -tt nvidia-smi
...
13:19:07.454027 connect(8, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, 37) = 0
13:19:07.454059 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
13:19:07.454089 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1073741816}) = 0
13:19:07.454115 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77c9f6800000
13:19:08.634910 mmap(NULL, 51539607552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77bdf6800000
13:19:16.336842 +++ killed by SIGKILL +++
Killed

For some reason the problem can be worked around by blocking the socket-connection to nvidia-persistenced:


$ sudo chmod o-w /var/run/nvidia-persistenced/socket
$ nvidia-smi
Mon Aug 26 13:30:32 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   50C    P0             27W /  285W |    1859MiB /  12282MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     20854      G   /usr/bin/gnome-shell                          703MiB |
|    0   N/A  N/A     23343      G   /usr/bin/Xwayland                             450MiB |
|    0   N/A  N/A     23415      G   ...nglingPtr --variations-seed-version        112MiB |
|    0   N/A  N/A     28718      G   /app/lib/firefox/firefox                      242MiB |
|    0   N/A  N/A     29032      G   /app/lib/thunderbird/thunderbird              113MiB |
+-----------------------------------------------------------------------------------------+
3 Likes

Hi all,

Sorry, there is no update yet. I have not able to reproduce this on my systems with Debian testing. Please capture a NVIDIA bug report after you hit this issue and attach it here. I will check this again and file a bug for tracking.

Thank you

Something tells me that you didn’t try at all

1 Like

Hi @taco-bell-5-layer-burrito ,

I saw the following message [EDIT - in the kernel logs] on my Debian testing system during the last round of testing -

`__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation`

but nvidia-smi runs without getting killed.

I saw a momentary increase in memory usage on running nvidia-smi. It did not increase indefinitely. If there are any additional configuration changes required to reproduce this, please let me know.

Thank you

How are you getting this message? What version of Debian are you actually using? When I run strace -v -tt nvidia-smi as @developer.nvidia.com26 did, I get the same message and result, where all my ram is being used by nvidia-smi.

__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation

Have you though about what that error message means? That vm_enough_memory stopped you from allocating that much memory because you didn’t have enough? Maybe that would cause nvidia-smi to stop that ridiculous memory allocation and continue normally? Maybe you should look into that error message instead of posting it and shrugging your shoulders.

Hi all,

Thank you for the feedback. We will try this out on the latest Debian testing image. I have filed a bug to track this internally at NVBug #4833179. I will share Engineering feedback when available.

Thank you

Hello.

Ran into this myself. My system was slow and my RAM completely consumed. Finally realized it was nvidia-smi; another tool was calling it in a loop and causing RAM to flood.

To verify, I made sure the looping tool was dead, then ran nvidia-smi directly, and watched the RAM get consumed.

Wanna watch? https://www.youtube.com/watch?v=zU1gfNk4kH0

Best part (not shown): subsequent runs don’t use whatever nvidia-smi cached; it starts fresh, pegging a CPU core to fill all RAM.

zoey@Clippy:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 24.10
Release:	24.10
Codename:	oracular
zoey@Clippy:~$ nvidia-smi
Thu Oct 10 22:08:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0  On |                  N/A |
|  0%   51C    P8             21W /  350W |      75MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    118489      G   /usr/bin/gnome-shell                           68MiB |
+-----------------------------------------------------------------------------------------+

Running sudo chmod o-w /var/run/nvidia-persistenced/socket as suggested by developer.nvidia.com26 above makes nvidia-smi instantly respond without consuming everything in sight. However, I severely dislike magical fixes and would love something that doesn’t require me to fiddle with files like this to work around bugs.

2 Likes

Same issue here on Debian testing. It’s been going on for quite some time. Persistence daemon is enabled. nvidia-smi 535.183.06 continues to try to map over 50GB of memory every time it runs. My machine has enough memory so it eventually runs as expected but the usage is ridiculous.

This was also reported already with no resolution and was automatically closed (cool): `nvidia-smi` Performance degredation

Note also doing brittle things like chmod are just going to break again when there’s a driver update. Better to just systemctl disable nvidia-persistencedso it sticks for the time being. Startup time for nvidia-smi will still be slowed if the GPU is not in use by anything since that’s the whole point of persistenced.

1 Like

Same issue here - was working no problems on Ubuntu 24.04 however as soon as I upgraded to 24.10 instantly eats up all CPU and Memory (64G)

This bug also exists on Ubuntu 24.10 with driver 560…35.3 but was not a problem on Ubuntu 24.04 and driver 535.183.06 (which was used before upgrade to 24.10). Present Linux kernel is 6.11.0.9.

Also reported on Ubuntu as Bug #2084987 “nvidia-smi is slow and has massive memory leak on ...” : Bugs : nvidia-graphics-drivers-560 package : Ubuntu

Hi all,

This was an issue with our nvidia-persistenced service on the latest Debian and Ubuntu test images. Engineering has identified the problem and submitted a fix. We have verified the fix on our systems.

Unfortunately, the fix is high-risk and requires a full QA test cycle at our end. This fix will be available in a future production branch.

Setting a lower limit for maximum number of open file descriptors can also be used as a potential workaround until the fix is available :-
# ulimit -Hn 16777216 or # ulimit -Hn 524288 instead of the current default value of 1073741816 (on Debian test, Ubuntu 24.10 nightly).

I will update this thread when the fix is available on a release driver.

Thank you

1 Like

Hello everyone,

I was experiencing the same issue with wayland/hyprland/sddm and a nvidia p400,

only the sudo chmod o-w /var/run/nvidia-persistenced/socket fix seemed to help, none of the other fixes in this thread seemed to do anything, (even the ulimit fix, even when scaled down to my memory size, was not effective)

Originally, it was using all 32 gigs of my ram, and 80gb of my swap file.

My logs:

2 Likes