We installed 8 Telsa P4 cards on our server. But yesterday night something was wrong with our software with the error log:
nnvidia-container-cli: initialization error: driver error: timed out\\n\\"\"": unknown
then we reboot the server. But when we use the “nvidia-smi” command to check GPU status, we find that the command only show 7 cards. We checked the PCIEs with command “lspci | grep -i nvidia”. It showed 8 nvidia GPU cards.
So here I wonder what’s wrong with the disappeared GPU card? How can I solve this problem?
You can
$ export CUDA_DEVICE_ORDER=PCI_BUS_ID
“CUDA_DEVICE_ORDER” will be ordered by PCI bus ID
The default order is “FASTEST_FIRST” mode.
I try this today, but it doesn’t work.
[root@localhost slxixiha]# lspci | grep -i nvidia
86:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
[root@localhost slxixiha]# nvidia-smi
Thu Mar 28 15:51:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:86:00.0 Off | 0 |
| N/A 37C P8 7W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@localhost slxixiha]# export CUDA_DEVICE_ORDER=PCI_BUS_ID
[root@localhost slxixiha]# nvidia-smi
Thu Mar 28 15:52:05 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:86:00.0 Off | 0 |
| N/A 37C P8 7W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I think your driver is not installed properly, I suggest you reinstall it.
Actually, I have reinstalled the driver, but still doesn’t work.
Can you find something from the following message?
[root@localhost ~]# lsmod | grep nvidia
nvidia_uvm 790989 0
nvidia_drm 43787 0
nvidia_modeset 1036572 1 nvidia_drm
nvidia 16641689 56 nvidia_modeset,nvidia_uvm
ipmi_msghandler 46608 4 ipmi_ssif,ipmi_devintf,nvidia,ipmi_si
drm_kms_helper 159169 2 ast,nvidia_drm
drm 370825 5 ast,ttm,drm_kms_helper,nvidia_drm
i2c_core 40756 8 ast,drm,igb,i2c_i801,ipmi_ssif,drm_kms_helper,i2c_algo_bit,nvidia
I find that the “i2c_core” item doesn’s show in another server. Does it matter?
mchi
March 28, 2019, 9:20am
7
Can you upload the log captued by command “sudo sudo nvidia-bug-report.sh”?
Is the machine passed NvQual ?
Sorry, my colleague thinks that something is wrong with that P4 card so they replaced it with another card.
I have uploaded the log captured by command “sudo nvidia-bug-report.sh”.
No, we don’t have used NvQual yet.
mchi
March 28, 2019, 1:47pm
9
No, we don’t have used NvQual yet. ==> without passing NvQual, any failure is expected. NVIDIA requires P4/T4 must be used on the machine passed NvQual.