I am using Debian testing on a laptop with an nvidia GPU and I’ve installed the nvidia drivers using nvidia-driver-full
package. I have done no other modifications. I am able to use the gpu with no problems, but when I run nvidia-smi
, it uses all of the ram and swap on my system. If I wait long enough, the system comes back to normal and the nvidia-smi
info is displayed. nvidia-smi
acts normally if I do valgrind nvidia-smi
, however.
Has anyone else experienced this?
2 Likes
Hello,
Thanks for posting this, I thought I was going crazy. I am experiencing the same thing - I think. Did you already find a solution?
I’m running Debian testing - fully up to date and have tried both the Debian nvidia packages as well as the latest packages from nvidia’s site. Both installs have the same results.
When I run nvidia-smi it quickly uses up all of my RAM (64 GB) and is then killed - I guess by the OOM killer. I do not have swap, so it can’t fill that up.
After crashing there is a backtrace and data dump into dmesg. Running with strace also crashes and creates a log. Running with valgrind runs normally, as you said.
I’m using a GeForce RTX 3070. I just got it, so I wasn’t sure if I had misconfigured something or if something was broken on either Debian’s part or Nvidia’s part.
I have not found a solution yet. I’ve been using a bash alias to call nvidia-smi
using valgrind and I called it a day:
# nvidia-smi eats all ram/swap unless run with valgrind
# not sure what the problem is, but this is an easier solution
function run-nvidia-smi {
valgrind nvidia-smi "$@" 2> /dev/null
}
alias nvidia-smi="run-nvidia-smi"
1 Like
Has anyone found a solution to this yet? It still seems to be affecting me on Debian testing.
For me the problem started occuring after upgrading my Ubuntu-installation from Noble to Oracular 1-2 weeks ago.
Running strace gives the following indication on where the problem occurs:
$ strace -v -tt nvidia-smi
...
13:19:07.454027 connect(8, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, 37) = 0
13:19:07.454059 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
13:19:07.454089 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1073741816}) = 0
13:19:07.454115 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77c9f6800000
13:19:08.634910 mmap(NULL, 51539607552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77bdf6800000
13:19:16.336842 +++ killed by SIGKILL +++
Killed
For some reason the problem can be worked around by blocking the socket-connection to nvidia-persistenced:
$ sudo chmod o-w /var/run/nvidia-persistenced/socket
$ nvidia-smi
Mon Aug 26 13:30:32 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 50C P0 27W / 285W | 1859MiB / 12282MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 20854 G /usr/bin/gnome-shell 703MiB |
| 0 N/A N/A 23343 G /usr/bin/Xwayland 450MiB |
| 0 N/A N/A 23415 G ...nglingPtr --variations-seed-version 112MiB |
| 0 N/A N/A 28718 G /app/lib/firefox/firefox 242MiB |
| 0 N/A N/A 29032 G /app/lib/thunderbird/thunderbird 113MiB |
+-----------------------------------------------------------------------------------------+
2 Likes
Hi all,
Sorry, there is no update yet. I have not able to reproduce this on my systems with Debian testing. Please capture a NVIDIA bug report after you hit this issue and attach it here. I will check this again and file a bug for tracking.
Thank you
Something tells me that you didn’t try at all
Hi @the.jonathan.yang ,
I saw the following message [EDIT - in the kernel logs] on my Debian testing system during the last round of testing -
`__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation`
but nvidia-smi runs without getting killed.
I saw a momentary increase in memory usage on running nvidia-smi. It did not increase indefinitely. If there are any additional configuration changes required to reproduce this, please let me know.
Thank you
How are you getting this message? What version of Debian are you actually using? When I run strace -v -tt nvidia-smi
as @developer.nvidia.com26 did, I get the same message and result, where all my ram is being used by nvidia-smi
.
__vm_enough_memory: pid: 1326, comm: nvidia-smi, bytes: 51539742720 not enough memory for the allocation
Have you though about what that error message means? That vm_enough_memory stopped you from allocating that much memory because you didn’t have enough? Maybe that would cause nvidia-smi
to stop that ridiculous memory allocation and continue normally? Maybe you should look into that error message instead of posting it and shrugging your shoulders.
Hi all,
Thank you for the feedback. We will try this out on the latest Debian testing image. I have filed a bug to track this internally at NVBug #4833179. I will share Engineering feedback when available.