Telsa P4 card disappear in "nvidia-smi" command

We installed 8 Telsa P4 cards on our server. But yesterday night something was wrong with our software with the error log:

nnvidia-container-cli: initialization error: driver error: timed out\\n\\"\"": unknown

then we reboot the server. But when we use the “nvidia-smi” command to check GPU status, we find that the command only show 7 cards. We checked the PCIEs with command “lspci | grep -i nvidia”. It showed 8 nvidia GPU cards.

So here I wonder what’s wrong with the disappeared GPU card? How can I solve this problem?

You can
$ export CUDA_DEVICE_ORDER=PCI_BUS_ID
“CUDA_​DEVICE_​ORDER” will be ordered by PCI bus ID

The default order is “FASTEST_FIRST” mode.

I try this today, but it doesn’t work.

[root@localhost slxixiha]# lspci | grep -i nvidia
86:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
[root@localhost slxixiha]# nvidia-smi
Thu Mar 28 15:51:45 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   37C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[root@localhost slxixiha]# export CUDA_DEVICE_ORDER=PCI_BUS_ID
[root@localhost slxixiha]# nvidia-smi
Thu Mar 28 15:52:05 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   37C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I think your driver is not installed properly, I suggest you reinstall it.

Actually, I have reinstalled the driver, but still doesn’t work.

Can you find something from the following message?

[root@localhost ~]# lsmod | grep nvidia
nvidia_uvm            790989  0 
nvidia_drm             43787  0 
nvidia_modeset       1036572  1 nvidia_drm
nvidia              16641689  56 nvidia_modeset,nvidia_uvm
ipmi_msghandler        46608  4 ipmi_ssif,ipmi_devintf,nvidia,ipmi_si
drm_kms_helper        159169  2 ast,nvidia_drm
drm                   370825  5 ast,ttm,drm_kms_helper,nvidia_drm
i2c_core               40756  8 ast,drm,igb,i2c_i801,ipmi_ssif,drm_kms_helper,i2c_algo_bit,nvidia

I find that the “i2c_core” item doesn’s show in another server. Does it matter?

Can you upload the log captued by command “sudo sudo nvidia-bug-report.sh”?
Is the machine passed NvQual ?

Sorry, my colleague thinks that something is wrong with that P4 card so they replaced it with another card.

I have uploaded the log captured by command “sudo nvidia-bug-report.sh”.

No, we don’t have used NvQual yet.

No, we don’t have used NvQual yet. ==> without passing NvQual, any failure is expected. NVIDIA requires P4/T4 must be used on the machine passed NvQual.

can you provide NvQual?