Hi,
I’m new in installation.
We first successfully installed cuda 12.1 on ubuntu 22.04 for 6 A800 GPUs. After cuda installed, nvidia-smi shows 6 GPUs but after few samples run, it shows only 5 GPUs. The 2nd one is missing.
We tried to reinstalled cuda and it shows 6 again. And then we run some samples, and it shows 5 GPUs again but this time, the 3rd GPU is missing. And following logs shows this :
nvidia-smi:
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A800 80GB PCIe Off| 00000000:19:00.0 Off | 0 |
| N/A 47C P0 79W / 300W| 63591MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A800 80GB PCIe Off| 00000000:1A:00.0 Off | 0 |
| N/A 47C P0 76W / 300W| 63591MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA A800 80GB PCIe Off| 00000000:1C:00.0 Off | 0 |
| N/A 40C P0 59W / 300W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA A800 80GB PCIe Off| 00000000:B3:00.0 Off | 0 |
| N/A 43C P0 60W / 300W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 4 NVIDIA A800 80GB PCIe Off| 00000000:B6:00.0 Off | 0 |
| N/A 42C P0 56W / 300W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 66721 C python 63588MiB |
| 1 N/A N/A 66721 C python 63588MiB |
±--------------------------------------------------------------------------------------+
And lspci always shows 6 GPUs:
lspci -nnv | grep -i nvidia
0000:19:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:20f5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1799]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
0000:1a:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:20f5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1799]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
0000:1b:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:20f5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1799]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
0000:1c:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:20f5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1799]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
0000:b3:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:20f5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1799]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
0000:b6:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:20f5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1799]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
I have captured the log as below:
nvidia-bug-report.log.gz (6.4 MB)