Driver 525.85.12 reports (-1)ul memory available?

I have an AC922 system with 4 V100-SXM2 GPUs installed which has rather suddenly decided that they will simply not work.

Device 0 reports overflow amount of memory available,

±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 35C P0 55W / 300W | 17592181850112Mi… | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

This has persisted despite device resets (nvidia-smi -r -i 0,1), across several drivers (from cuda 11.4, 11.7 and 12.0), across resets and even cold power off / power on cycles.

There is no apparent way to get any useful error output. Even the most basic query utilities don’t work,

./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 3
→ initialization error
Result = FAIL
The driver loads fine and there is nothing in dmesg to indicate that anything is wrong.

A similar “completely unresponsive but no error output” problem is occurring on a second 922, but without reporting -1 free memory, which gives me some hope that maybe this is not dying hardware?
nvidia-bug-report.log.gz (3.4 MB)

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Done & attached

Two things I noticed:

  • the Xserver is enabled to start but crashes and restarts in a loop
  • nvidia-persistenced is not running

Please disable the xserver and configure nvidia persistenced to start on boot with root privileges and using persistence (i.e. run it without options).

It looks like starting persistenced fixes the problem and cleared the -1 free memory problem. Thank you!

So is having persistenced running a requirement for nvlinked devices?

Among other situations, nvidia-persistenced is especially a hard dependency on Power9 systems, won’t work without it. IIRC, those also needed a specific config change regarding udev+numa, please check the Power9 specific nvidia setup guides for it.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.