Telsa P4 card disappear in "nvidia-smi" command

slxixiha · March 20, 2019, 8:34am

We installed 8 Telsa P4 cards on our server. But yesterday night something was wrong with our software with the error log：

nnvidia-container-cli: initialization error: driver error: timed out\\n\\"\"": unknown

then we reboot the server. But when we use the “nvidia-smi” command to check GPU status, we find that the command only show 7 cards. We checked the PCIEs with command “lspci | grep -i nvidia”. It showed 8 nvidia GPU cards.

So here I wonder what’s wrong with the disappeared GPU card? How can I solve this problem?

ChrisDing · March 22, 2019, 1:49am

You can
$ export CUDA_DEVICE_ORDER=PCI_BUS_ID
“CUDA_DEVICE_ORDER” will be ordered by PCI bus ID

The default order is “FASTEST_FIRST” mode.

slxixiha · March 28, 2019, 7:59am

I try this today, but it doesn’t work.

[root@localhost slxixiha]# lspci | grep -i nvidia
86:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
[root@localhost slxixiha]# nvidia-smi
Thu Mar 28 15:51:45 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   37C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[root@localhost slxixiha]# export CUDA_DEVICE_ORDER=PCI_BUS_ID
[root@localhost slxixiha]# nvidia-smi
Thu Mar 28 15:52:05 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   37C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

ChrisDing · March 28, 2019, 8:36am

I think your driver is not installed properly, I suggest you reinstall it.

slxixiha · March 28, 2019, 9:07am

Actually, I have reinstalled the driver, but still doesn’t work.

Can you find something from the following message？

[root@localhost ~]# lsmod | grep nvidia
nvidia_uvm            790989  0 
nvidia_drm             43787  0 
nvidia_modeset       1036572  1 nvidia_drm
nvidia              16641689  56 nvidia_modeset,nvidia_uvm
ipmi_msghandler        46608  4 ipmi_ssif,ipmi_devintf,nvidia,ipmi_si
drm_kms_helper        159169  2 ast,nvidia_drm
drm                   370825  5 ast,ttm,drm_kms_helper,nvidia_drm
i2c_core               40756  8 ast,drm,igb,i2c_i801,ipmi_ssif,drm_kms_helper,i2c_algo_bit,nvidia

slxixiha · March 28, 2019, 9:12am

I find that the “i2c_core” item doesn’s show in another server. Does it matter?

mchi · March 28, 2019, 9:20am

Can you upload the log captued by command “sudo sudo nvidia-bug-report.sh”?
Is the machine passed NvQual ?

slxixiha · March 28, 2019, 12:35pm

Sorry， my colleague thinks that something is wrong with that P4 card so they replaced it with another card.

I have uploaded the log captured by command “sudo nvidia-bug-report.sh”.

No, we don’t have used NvQual yet.

mchi · March 28, 2019, 1:47pm

No, we don’t have used NvQual yet. ==> without passing NvQual, any failure is expected. NVIDIA requires P4/T4 must be used on the machine passed NvQual.

slxixiha · April 3, 2019, 2:44am

can you provide NvQual?

Topic		Replies	Views
Cannot Recognize 4th Tesla CUDA Programming and Performance	5	8737	September 23, 2009
A PROBLEM with TESLA CARD! I have an unkown problem with Tesla Card CUDA Programming and Performance	3	7272	April 21, 2008
How to determine which card I am using CUDA Programming and Performance	3	4142	May 8, 2009
How Can I change device order in multiGPU? CUDA Programming and Performance	3	8024	May 6, 2008
TESLA Card with non NVIDIA graphic card CUDA Programming and Performance	4	5268	October 15, 2008
Testla T4 always stays in P0 pstate Linux	3	2283	October 12, 2021
Tesla on Asus P5N-E SLI? CUDA Programming and Performance	3	13618	January 28, 2008
Cannot find Tesla cards NVIDIA: could not open the device file /dev/nvidia2 CUDA Programming and Performance	1	4106	February 2, 2009
nvidia-smi and /dev/nvidia* does not match CUDA Setup and Installation	2	1999	August 28, 2017
Selecting which card to use in cuda-gdb CUDA Programming and Performance	2	2888	June 4, 2010

Telsa P4 card disappear in "nvidia-smi" command

Related topics