Hi,
I have a Ubuntu 16.04 system and installed CUDA 8.0 and driver 375.82 for 2 GTX 1080 Ti card.
But the nvidia-smi only detect one card as follow:
$ nvidia-smi
Fri Aug 4 16:26:19 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 375.82 Driver Version: 375.82 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 0000:01:00.0 Off | N/A |
| 0% 31C P0 60W / 360W | 0MiB / 11170MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
$ lspci|grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
It seems the system can detect both GPU cards, but the driver can only detect one. Any ideas to help ? Thanks!
#Nothing show for “dmesg|grep NVRM”, nothing too with sudo
$ dmesg|grep NVRM
Besides, sometimes, “nvidia-smi” can detect the Both GPUs, but with one in “ERR!” label as follows:
$ nvidia-smi
Sat Aug 5 11:31:09 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 375.82 Driver Version: 375.82 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 0000:01:00.0 Off | N/A |
| 0% 34C P0 61W / 360W | 9135MiB / 11170MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… Off | 0000:02:00.0 Off | N/A |
|ERR! 33C P8 ERR! / 360W | 10MiB / 11172MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2033 C python 9133MiB |
| 1 2033 C python 8MiB |
±----------------------------------------------------------------------------+
that dmesg output doesn’t make sense to me, since the NVIDIA Driver puts a message in the system log when it is loading, like this:
$ dmesg |grep NVRM
[ 6.655273] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 375.66 Mon May 1 15:29:16 PDT 2017 (using threaded interrupts)
Something there doesn’t add up
maybe you haven’t removed nouveau properly
instead of 375.82 driver, you may want to try 384.59:
http://www.nvidia.com/download/driverResults.aspx/120917/en-us
Make sure to follow the instructions in the CUDA 8 linux install guide concerning removal of nouveau:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#abstract
(probably a good idea to read the whole install guide, if you haven’t)
I reinstall the cuda-8.0 without install the default driver and install the driver 384.59.Now “nvidia-smi” also detects only one GPU card.
~$ nvidia-smi
Sun Aug 6 14:05:10 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59 Driver Version: 384.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 On | N/A |
| 0% 33C P8 11W / 360W | 126MiB / 11169MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1033 G /usr/lib/xorg/Xorg 74MiB |
| 0 1772 G compiz 41MiB |
| 0 2251 G fcitx-qimpanel 8MiB |
+-----------------------------------------------------------------------------+
~$ dmesg|grep NVRM
[ 2.081639] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.59 Wed Jul 19 23:53:34 PDT 2017 (using threaded interrupts)
[ 3.226094] NVRM: Your system is not currently configured to drive a VGA console
[ 4.183854] NVRM: GPU at PCI:0000:02:00: GPU-a29e3dde-b03b-5484-803c-c5bf5b3df99c
[ 4.183857] NVRM: GPU Board Serial Number:
[ 4.183858] NVRM: Xid (PCI:0000:02:00): 62, 1bad(b2f4) 00000000 00000000
[ 62.305789] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[ 62.305834] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 75.165485] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[ 75.165517] NVRM: rm_init_adapter failed for device bearing minor number 1
Other informations
~$ cat /proc/driver/nvidia/gpus/0000\:01\:00.0/information
Model: GeForce GTX 1080 Ti
IRQ: 135
GPU UUID: GPU-637bcd08-b214-bd17-8b99-035cfea0b6a7
Video BIOS: 86.02.39.00.9c
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:01:00.0
Device Minor: 0
~$ cat /proc/driver/nvidia/gpus/0000\:02\:00.0/information
Model: GeForce GTX 1080 Ti
IRQ: 136
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:02:00.0
Device Minor: 1
The loaded nvidia modules:
~$ lsmod|grep nvidia
nvidia_drm 49152 1
nvidia_modeset 843776 5 nvidia_drm
nvidia 13041664 94 nvidia_modeset
drm_kms_helper 155648 2 i915_bpo,nvidia_drm
drm 364544 5 i915_bpo,drm_kms_helper,nvidia_drm
Systems:
~$ uname -a
Linux nisp-dmi-02 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Module information:
~$ modinfo nvidia
filename: /lib/modules/4.4.0-89-generic/kernel/drivers/video/nvidia.ko
alias: char-major-195-*
version: 384.59
supported: external
license: NVIDIA
srcversion: 31FF0349D3C7B1D9A62B474
alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends:
vermagic: 4.4.0-89-generic SMP mod_unload modversions
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_CheckPCIConfigSpace:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_TCEBypassMode:int
parm: NVreg_UseThreadedInterrupts:int
parm: NVreg_EnableStreamMemOPs:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RegistryDwordsPerDevice:charp
parm: NVreg_RmMsg:charp
parm: NVreg_AssignGpus:charp
The nouveau driver is blacklisted and unloaded:
~$ cat /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
Maybe you have insufficient power delivered to the 2nd GPU
I switched the power wire of GPU1 and GPU2, the same issue occured,
There exists water-cooled device for the GPUs, so it is hard for me to switch the GPU position to test.
So it is not the problem of software but hardware problem?
herber
February 5, 2018, 2:39am
10
I have encountered these errors over the years and alway struggled to figure out what is really wrong.
Have I configured the kernel properly and have I used the right nvidia driver? Very frustrating that there is not a tool to help you check the software configuration.
I have three 450’s and a 750 in three different computers.
On the machine with a 450 and a 750 the 450 is working great but the 750
gets a video bios copy failure and this errro:
[ 62.305789] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
And this when running nvidia-smi:
GPU UUID: GPU-???-???-???-???-???
The 450’s work any any of my three gentoo computers but the 750 never works.
I have decided that there is a hardware problem with the 750.
How can you test for hardware problems?