Nvidia command cannot see second GPU

Hi,

I had a weird problem regarding the second GPU attached to our workstation machine. At first when it was installed,nvidia-smi could see both GPU cards and we could run PyTorch programs to train models. However, after several times pausing the python programs and relaunching, nvidia-smi displayed ERROR for the second GPU card and subsequently, the information of the second GPU disappeared.

We tried to use lspci | grep VGA and it gives:

03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3b:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
d8:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)

which shows two Quadro RTX 8000 cards.

But with nvidia-smi, we got the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:D8:00.0 Off |                  Off |
| 33%   24C    P8     5W / 260W |      1MiB / 49152MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

We observed that nvidia-smi was much slower than usual after the second card could not be detected.

Please see the attached bug report: nvidia-bug-report.log.gz (505.4 KB)

We also tried:

  • Re-install the system and follow the instructions for the cuda toolkit 11.7.
  • Unplug and re-install the GPU cards.
    but none of these worked.

Please help us!