Hi,
I had a weird problem regarding the second GPU attached to our workstation machine. At first when it was installed,nvidia-smi
could see both GPU cards and we could run PyTorch programs to train models. However, after several times pausing the python programs and relaunching, nvidia-smi
displayed ERROR for the second GPU card and subsequently, the information of the second GPU disappeared.
We tried to use lspci | grep VGA
and it gives:
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3b:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
d8:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
which shows two Quadro RTX 8000 cards.
But with nvidia-smi
, we got the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:D8:00.0 Off | Off |
| 33% 24C P8 5W / 260W | 1MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
We observed that nvidia-smi
was much slower than usual after the second card could not be detected.
Please see the attached bug report: nvidia-bug-report.log.gz (505.4 KB)
We also tried:
- Re-install the system and follow the instructions for the cuda toolkit 11.7.
- Unplug and re-install the GPU cards.
but none of these worked.
Please help us!