I have an AC922 system with 4 V100-SXM2 GPUs installed which has rather suddenly decided that they will simply not work.
Device 0 reports overflow amount of memory available,
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 35C P0 55W / 300W | 17592181850112Mi… | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
This has persisted despite device resets (nvidia-smi -r -i 0,1), across several drivers (from cuda 11.4, 11.7 and 12.0), across resets and even cold power off / power on cycles.
There is no apparent way to get any useful error output. Even the most basic query utilities don’t work,
./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 3
→ initialization error
Result = FAIL
The driver loads fine and there is nothing in dmesg to indicate that anything is wrong.
A similar “completely unresponsive but no error output” problem is occurring on a second 922, but without reporting -1 free memory, which gives me some hope that maybe this is not dying hardware?
nvidia-bug-report.log.gz (3.4 MB)