Nvidia-smi uses all of ram and swap

taco-bell-5-layer-burrito · June 8, 2024, 3:19am

I am using Debian testing on a laptop with an nvidia GPU and I’ve installed the nvidia drivers using nvidia-driver-full package. I have done no other modifications. I am able to use the gpu with no problems, but when I run nvidia-smi, it uses all of the ram and swap on my system. If I wait long enough, the system comes back to normal and the nvidia-smi info is displayed. nvidia-smi acts normally if I do valgrind nvidia-smi, however.

Has anyone else experienced this?

stuporglue · July 30, 2024, 5:46pm

Hello,

Thanks for posting this, I thought I was going crazy. I am experiencing the same thing - I think. Did you already find a solution?

I’m running Debian testing - fully up to date and have tried both the Debian nvidia packages as well as the latest packages from nvidia’s site. Both installs have the same results.

When I run nvidia-smi it quickly uses up all of my RAM (64 GB) and is then killed - I guess by the OOM killer. I do not have swap, so it can’t fill that up.

After crashing there is a backtrace and data dump into dmesg. Running with strace also crashes and creates a log. Running with valgrind runs normally, as you said.

I’m using a GeForce RTX 3070. I just got it, so I wasn’t sure if I had misconfigured something or if something was broken on either Debian’s part or Nvidia’s part.

taco-bell-5-layer-burrito · July 30, 2024, 6:18pm

I have not found a solution yet. I’ve been using a bash alias to call nvidia-smi using valgrind and I called it a day:

# nvidia-smi eats all ram/swap unless run with valgrind
# not sure what the problem is, but this is an easier solution
function run-nvidia-smi {
    valgrind nvidia-smi "$@" 2> /dev/null
}
alias nvidia-smi="run-nvidia-smi"

georgeoshardo · August 25, 2024, 11:57pm

Has anyone found a solution to this yet? It still seems to be affecting me on Debian testing.

developer.nvidia.com26 · August 26, 2024, 11:31am

For me the problem started occuring after upgrading my Ubuntu-installation from Noble to Oracular 1-2 weeks ago.

Running strace gives the following indication on where the problem occurs:


$ strace -v -tt nvidia-smi
...
13:19:07.454027 connect(8, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, 37) = 0
13:19:07.454059 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
13:19:07.454089 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1073741816}) = 0
13:19:07.454115 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77c9f6800000
13:19:08.634910 mmap(NULL, 51539607552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77bdf6800000
13:19:16.336842 +++ killed by SIGKILL +++
Killed

For some reason the problem can be worked around by blocking the socket-connection to nvidia-persistenced:


$ sudo chmod o-w /var/run/nvidia-persistenced/socket
$ nvidia-smi
Mon Aug 26 13:30:32 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   50C    P0             27W /  285W |    1859MiB /  12282MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     20854      G   /usr/bin/gnome-shell                          703MiB |
|    0   N/A  N/A     23343      G   /usr/bin/Xwayland                             450MiB |
|    0   N/A  N/A     23415      G   ...nglingPtr --variations-seed-version        112MiB |
|    0   N/A  N/A     28718      G   /app/lib/firefox/firefox                      242MiB |
|    0   N/A  N/A     29032      G   /app/lib/thunderbird/thunderbird              113MiB |
+-----------------------------------------------------------------------------------------+

abchauhan · August 27, 2024, 12:26am

Hi all,

Sorry, there is no update yet. I have not able to reproduce this on my systems with Debian testing. Please capture a NVIDIA bug report after you hit this issue and attach it here. I will check this again and file a bug for tracking.

Thank you

taco-bell-5-layer-burrito · August 27, 2024, 1:26am

Something tells me that you didn’t try at all

abchauhan · August 27, 2024, 1:48am

Hi @taco-bell-5-layer-burrito ,

I saw the following message [EDIT - in the kernel logs] on my Debian testing system during the last round of testing -

`__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation`

but nvidia-smi runs without getting killed.

I saw a momentary increase in memory usage on running nvidia-smi. It did not increase indefinitely. If there are any additional configuration changes required to reproduce this, please let me know.

Thank you

taco-bell-5-layer-burrito · August 27, 2024, 3:05am

How are you getting this message? What version of Debian are you actually using? When I run strace -v -tt nvidia-smi as @developer.nvidia.com26 did, I get the same message and result, where all my ram is being used by nvidia-smi.

__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation

Have you though about what that error message means? That vm_enough_memory stopped you from allocating that much memory because you didn’t have enough? Maybe that would cause nvidia-smi to stop that ridiculous memory allocation and continue normally? Maybe you should look into that error message instead of posting it and shrugging your shoulders.

abchauhan · August 28, 2024, 11:03pm

Hi all,

Thank you for the feedback. We will try this out on the latest Debian testing image. I have filed a bug to track this internally at NVBug #4833179. I will share Engineering feedback when available.

taco-bell-5-layer-burrito · August 28, 2024, 11:26pm

Thank you

nvidia3171 · October 11, 2024, 4:11am

Hello.

Ran into this myself. My system was slow and my RAM completely consumed. Finally realized it was nvidia-smi; another tool was calling it in a loop and causing RAM to flood.

To verify, I made sure the looping tool was dead, then ran nvidia-smi directly, and watched the RAM get consumed.

Wanna watch? https://www.youtube.com/watch?v=zU1gfNk4kH0

Best part (not shown): subsequent runs don’t use whatever nvidia-smi cached; it starts fresh, pegging a CPU core to fill all RAM.

zoey@Clippy:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 24.10
Release:	24.10
Codename:	oracular
zoey@Clippy:~$ nvidia-smi
Thu Oct 10 22:08:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0  On |                  N/A |
|  0%   51C    P8             21W /  350W |      75MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    118489      G   /usr/bin/gnome-shell                           68MiB |
+-----------------------------------------------------------------------------------------+

Running sudo chmod o-w /var/run/nvidia-persistenced/socket as suggested by developer.nvidia.com26 above makes nvidia-smi instantly respond without consuming everything in sight. However, I severely dislike magical fixes and would love something that doesn’t require me to fiddle with files like this to work around bugs.

metzran · October 14, 2024, 8:47am

Same issue here on Debian testing. It’s been going on for quite some time. Persistence daemon is enabled. nvidia-smi 535.183.06 continues to try to map over 50GB of memory every time it runs. My machine has enough memory so it eventually runs as expected but the usage is ridiculous.

This was also reported already with no resolution and was automatically closed (cool): `nvidia-smi` Performance degredation

Note also doing brittle things like chmod are just going to break again when there’s a driver update. Better to just systemctl disable nvidia-persistencedso it sticks for the time being. Startup time for nvidia-smi will still be slowed if the GPU is not in use by anything since that’s the whole point of persistenced.

mitchellj · October 21, 2024, 4:56am

Same issue here - was working no problems on Ubuntu 24.04 however as soon as I upgraded to 24.10 instantly eats up all CPU and Memory (64G)

rhenschel · October 25, 2024, 9:31pm

This bug also exists on Ubuntu 24.10 with driver 560…35.3 but was not a problem on Ubuntu 24.04 and driver 535.183.06 (which was used before upgrade to 24.10). Present Linux kernel is 6.11.0.9.

Also reported on Ubuntu as Bug #2084987 “nvidia-smi is slow and has massive memory leak on ...” : Bugs : nvidia-graphics-drivers-560 package : Ubuntu

abchauhan · October 28, 2024, 10:38pm

Hi all,

This was an issue with our nvidia-persistenced service on the latest Debian and Ubuntu test images. Engineering has identified the problem and submitted a fix. We have verified the fix on our systems.

Unfortunately, the fix is high-risk and requires a full QA test cycle at our end. This fix will be available in a future production branch.

Setting a lower limit for maximum number of open file descriptors can also be used as a potential workaround until the fix is available :-
# ulimit -Hn 16777216 or # ulimit -Hn 524288 instead of the current default value of 1073741816 (on Debian test, Ubuntu 24.10 nightly).

I will update this thread when the fix is available on a release driver.

Thank you

mohitv · November 14, 2024, 6:18am

Hello everyone,

I was experiencing the same issue with wayland/hyprland/sddm and a nvidia p400,

only the sudo chmod o-w /var/run/nvidia-persistenced/socket fix seemed to help, none of the other fixes in this thread seemed to do anything, (even the ulimit fix, even when scaled down to my memory size, was not effective)

Originally, it was using all 32 gigs of my ram, and 80gb of my swap file.

My logs:

zndr27 · December 17, 2024, 7:39am

I’m still having this issue on Ubuntu 24.10

vesper.mail · December 22, 2024, 10:53am

Having this issue on Kubuntu 24.10 with the latest driver. Setting ulimit does seem to help.

rubikssolver4 · January 4, 2025, 9:45pm

I’m also still having this issue on Ubuntu 24.10, the chmod command hack worked for me.

Topic		Replies	Views
560 - Nvidia-smi memory leak .. temporary fix found here Linux	4	550	May 30, 2025
Huge RAM usage when using nvidia-smi Drivers - Linux, Windows, MacOS nvbugs , nvidia-smi	3	401	February 3, 2025
`nvidia-smi` Performance degredation CUDA Programming and Performance	5	474	August 14, 2024
Nvidia-smi triggers OOM and is killed by kernel on Ubuntu 25.04 with driver 570.133.07 Linux kernel , ubuntu	1	188	May 25, 2025
Cannot nvidia-smi Geforce 1070 anymore suddenly. Linux	9	1657	October 12, 2021
384.90 on Centos7.3 would sometimes block CPU Linux	6	1288	March 8, 2018
Power9 - nvidia-smi shows "unknown error" in memory column Linux	35	10295	October 14, 2021
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Linux	125	85213	August 5, 2024
GeForce RTX 2080 ERR! show in nvidia-smi Linux	33	4990	April 12, 2019
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running Drivers - Linux, Windows, MacOS cuda	38	32137	August 29, 2024

Nvidia-smi uses all of ram and swap

Related topics