Nvidia-smi uses all of ram and swap

I am using Debian testing on a laptop with an nvidia GPU and I’ve installed the nvidia drivers using nvidia-driver-full package. I have done no other modifications. I am able to use the gpu with no problems, but when I run nvidia-smi, it uses all of the ram and swap on my system. If I wait long enough, the system comes back to normal and the nvidia-smi info is displayed. nvidia-smi acts normally if I do valgrind nvidia-smi, however.

Has anyone else experienced this?

2 Likes

Hello,

Thanks for posting this, I thought I was going crazy. I am experiencing the same thing - I think. Did you already find a solution?

I’m running Debian testing - fully up to date and have tried both the Debian nvidia packages as well as the latest packages from nvidia’s site. Both installs have the same results.

When I run nvidia-smi it quickly uses up all of my RAM (64 GB) and is then killed - I guess by the OOM killer. I do not have swap, so it can’t fill that up.

After crashing there is a backtrace and data dump into dmesg. Running with strace also crashes and creates a log. Running with valgrind runs normally, as you said.

I’m using a GeForce RTX 3070. I just got it, so I wasn’t sure if I had misconfigured something or if something was broken on either Debian’s part or Nvidia’s part.

I have not found a solution yet. I’ve been using a bash alias to call nvidia-smi using valgrind and I called it a day:

# nvidia-smi eats all ram/swap unless run with valgrind
# not sure what the problem is, but this is an easier solution
function run-nvidia-smi {
    valgrind nvidia-smi "$@" 2> /dev/null
}
alias nvidia-smi="run-nvidia-smi"
1 Like

Has anyone found a solution to this yet? It still seems to be affecting me on Debian testing.

For me the problem started occuring after upgrading my Ubuntu-installation from Noble to Oracular 1-2 weeks ago.

Running strace gives the following indication on where the problem occurs:


$ strace -v -tt nvidia-smi
...
13:19:07.454027 connect(8, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, 37) = 0
13:19:07.454059 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
13:19:07.454089 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1073741816}) = 0
13:19:07.454115 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77c9f6800000
13:19:08.634910 mmap(NULL, 51539607552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77bdf6800000
13:19:16.336842 +++ killed by SIGKILL +++
Killed

For some reason the problem can be worked around by blocking the socket-connection to nvidia-persistenced:


$ sudo chmod o-w /var/run/nvidia-persistenced/socket
$ nvidia-smi
Mon Aug 26 13:30:32 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   50C    P0             27W /  285W |    1859MiB /  12282MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     20854      G   /usr/bin/gnome-shell                          703MiB |
|    0   N/A  N/A     23343      G   /usr/bin/Xwayland                             450MiB |
|    0   N/A  N/A     23415      G   ...nglingPtr --variations-seed-version        112MiB |
|    0   N/A  N/A     28718      G   /app/lib/firefox/firefox                      242MiB |
|    0   N/A  N/A     29032      G   /app/lib/thunderbird/thunderbird              113MiB |
+-----------------------------------------------------------------------------------------+
2 Likes

Hi all,

Sorry, there is no update yet. I have not able to reproduce this on my systems with Debian testing. Please capture a NVIDIA bug report after you hit this issue and attach it here. I will check this again and file a bug for tracking.

Thank you

Something tells me that you didn’t try at all

Hi @the.jonathan.yang ,

I saw the following message [EDIT - in the kernel logs] on my Debian testing system during the last round of testing -

`__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation`

but nvidia-smi runs without getting killed.

I saw a momentary increase in memory usage on running nvidia-smi. It did not increase indefinitely. If there are any additional configuration changes required to reproduce this, please let me know.

Thank you

How are you getting this message? What version of Debian are you actually using? When I run strace -v -tt nvidia-smi as @developer.nvidia.com26 did, I get the same message and result, where all my ram is being used by nvidia-smi.

__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation

Have you though about what that error message means? That vm_enough_memory stopped you from allocating that much memory because you didn’t have enough? Maybe that would cause nvidia-smi to stop that ridiculous memory allocation and continue normally? Maybe you should look into that error message instead of posting it and shrugging your shoulders.

Hi all,

Thank you for the feedback. We will try this out on the latest Debian testing image. I have filed a bug to track this internally at NVBug #4833179. I will share Engineering feedback when available.

Thank you