nvidia-smi and /dev/nvidia* does not match

nvidia-smi shows 7 GPU card

$ nvidia-smi
Thu Aug 24 10:23:05 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 0000:04:00.0     Off |                    0 |
| N/A   35C    P0    53W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 0000:05:00.0     Off |                    0 |
| N/A   28C    P0    50W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 0000:06:00.0     Off |                    0 |
| N/A   33C    P0    51W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 0000:07:00.0     Off |                    0 |
| N/A   35C    P0    51W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 0000:0B:00.0     Off |                    0 |
| N/A   31C    P0    51W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 0000:0C:00.0     Off |                    0 |
| N/A   29C    P0    50W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 0000:0E:00.0     Off |                    0 |
| N/A   27C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |

But I have 8 GPU card on host:

$ ls /dev/nvidia*
/dev/nvidia0  /dev/nvidia1  /dev/nvidia2  /dev/nvidia3  /dev/nvidia4  /dev/nvidia5  /dev/nvidia6  /dev/nvidia7  /dev/nvidiactl  /dev/nvidia-uvm

what sort of system are the 8 P40 GPUs installed in?

what is the result of:

dmesg |grep NVRM

I would say that some of the most likely possibilities are:

  • overheating of the 8th GPU
  • inadequate power delivery to the 8th GPU
  • failure of the system BIOS to provide appropriate resource mapping for the 8th GPU

Hi, @txbob

Sorry for reply later.

OS: CentOS 7.2 , kernel: 4.4.79-1.el7.elrepo.x86_64

dmesg message:

[2258112.560811] NVRM: RmInitAdapter failed! (0x26:0xffff:1096)
[2258112.561073] NVRM: rm_init_adapter failed for device bearing minor number 6
[2258124.663702] NVRM: RmInitAdapter failed! (0x26:0xffff:1096)
[2258124.663797] NVRM: rm_init_adapter failed for device bearing minor number 6