Driver 525.85.12 reports (-1)ul memory available?

erik-k · February 9, 2023, 11:42pm

I have an AC922 system with 4 V100-SXM2 GPUs installed which has rather suddenly decided that they will simply not work.

Device 0 reports overflow amount of memory available,

±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 35C P0 55W / 300W | 17592181850112Mi… | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

This has persisted despite device resets (nvidia-smi -r -i 0,1), across several drivers (from cuda 11.4, 11.7 and 12.0), across resets and even cold power off / power on cycles.

There is no apparent way to get any useful error output. Even the most basic query utilities don’t work,

./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 3
→ initialization error
Result = FAIL
The driver loads fine and there is nothing in dmesg to indicate that anything is wrong.

A similar “completely unresponsive but no error output” problem is occurring on a second 922, but without reporting -1 free memory, which gives me some hope that maybe this is not dying hardware?
nvidia-bug-report.log.gz (3.4 MB)

generix · February 10, 2023, 12:11am

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

erik-k · February 10, 2023, 12:43am

Done & attached

generix · February 10, 2023, 9:36am

Two things I noticed:

the Xserver is enabled to start but crashes and restarts in a loop
nvidia-persistenced is not running

Please disable the xserver and configure nvidia persistenced to start on boot with root privileges and using persistence (i.e. run it without options).

erik-k · February 10, 2023, 11:05pm

It looks like starting persistenced fixes the problem and cleared the -1 free memory problem. Thank you!

So is having persistenced running a requirement for nvlinked devices?

generix · February 11, 2023, 1:34pm

Among other situations, nvidia-persistenced is especially a hard dependency on Power9 systems, won’t work without it. IIRC, those also needed a specific config change regarding udev+numa, please check the Power9 specific nvidia setup guides for it.

Topic		Replies	Views
Power9 - nvidia-smi shows "unknown error" in memory column Linux	35	10718	October 14, 2021
Nvidia-persistenced: Failed to query NVIDIA devices Application Accelerator Software cuda , kernel , ubuntu	8	15940	August 18, 2023
Nvidia-smi Memory Leak with 570.133.07 and Ubuntu 25.04 Linux ubuntu	8	1067	September 9, 2025
Nvidia-smi and nvidia-persistenced hangs with nvidia driver issue on A100 NVIDIA Virtual GPU Technology	1	2743	January 16, 2024
Nvidia-smi uses all of ram and swap Linux	28	8283	June 2, 2025
Only 300 MB of free memory on Tesla S2050 GPUs CUDA Programming and Performance	6	1334	January 11, 2011
GPU usage CUDA Setup and Installation	2	1764	October 29, 2017
[575.64] NVRM Out of memory error causes dGPU to not be usable after some time Linux	22	2790	August 23, 2025
NVIDIA unusable in Ubuntu 20.10 (volatile GPU RAM crap) Linux	4	856	January 11, 2021
GPU usage CUDA Setup and Installation	5	1217	October 30, 2017

Driver 525.85.12 reports (-1)ul memory available?

Related topics