nvidia-smi "No devices were found" error

Hi.

I am trying to get my K40c running on a redhat enterprise linux system.
However, I’m having trouble getting nvidia-smi to recognize the GPU; I get the “No devices were found” error when I typ “nvidia-smi -a”

I installed the CUDA 7.0 toolkit, then upgraded the driver to 346.59, and then rebooted the system.

Here are some info:

ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Jun 22 14:22 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jun 22 14:22 /dev/nvidiactl

lspci | grep -i nvidia
03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K40c] (rev a1)

modprobe -l | grep nvidia
kernel/drivers/video/backlight/mbp_nvidia_bl.ko
kernel/drivers/video/nvidia/nvidiafb.ko
extra/nvidia-uvm.ko
extra/nvidia.ko

lsmod | grep nvidia
nvidia 8368780 0
i2c_core 31084 4 igb,i2c_algo_bit,i2c_i801,nvidia

modinfo nvidia
filename: /lib/modules/2.6.32-431.5.1.el6.x86_64/extra/nvidia.ko
alias: char-major-195-*
version: 346.59
supported: external
license: NVIDIA
alias: pci:v000010DEd00000E00svsdbc04sc80i00*
alias: pci:v000010DEd00000AA3svsdbc0Bsc40i00*
alias: pci:v000010DEdsvsdbc03sc02i00
alias: pci:v000010DEdsvsdbc03sc00i00
depends: i2c-core
vermagic: 2.6.32-431.5.1.el6.x86_64 SMP mod_unload modversions
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_RemapLimit:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_CheckPCIConfigSpace:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RmMsg:charp
parm: NVreg_AssignGpus:charp

how did you install the toolkit?
how did you upgrade the driver?

is the behavior any different if you run nvidia-smi as root?

what is the result of running:

dmesg | grep NVRM

?

I installed the toolkit and the driver by downloading and running cuda_7.0.28_linux.run and NVIDIA-Linux-x86_64-346.59.run (in that order).

sudo nvidia-smi -a gives me the same error message, “No devices were found.”

dmesg | grep NVRM
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 346.59 Tue Mar 31 14:10:31 PDT 2015
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5

It may be an interaction with nouveau. Have you explicitly removed the nouveau driver?

The method is described in the linux getting started guide:

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#runfile-nouveau

neither ‘lsmod | grep nouveau’ nor ’ sudo lsmod | grep nouveau’
returns anything.

However, I went and created the blacklist file for RHEL and rebooted the system, and nvidia-smi still can’t locate the GPU.

The blacklist file alone is not necessarily enough. If nouveau is in the initrd, it must be removed from that as well.

It may also be a BIOS problem with the system that it is plugged into. To discover that you would need to run

lspci -vvv

and focus on the allocations for the K40 GPU. Following the sequence here:

https://devtalk.nvidia.com/default/topic/816404/cuda-programming-and-performance/plugging-tesla-k80-results-in-pci-resource-allocation-error-/

I’m fairly certain noveau’s not an issue, as I uninstalled it using yum (remove xorg-x11-drv-nouveau.x86_64). Also, if it was loaded, it would probably show up in lsmod.
(Also, I did dracut --force).
I’ll try what’s on the link, but I don’t see how the BIOS could be an issue, as it was working fine just a few weeks ago (I haven’t touched it since the last time I used it).
Only thing I could think of is, Redhat installed something automatically (maybe a system update?) that’s causing this issue.

I’ll update this post after I’ve tried your suggestion.

Thank you,

This is the output from lspci -vvv

03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K40c] (rev a1)
Subsystem: NVIDIA Corporation Device 0983
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 32
Region 0: Memory at ce000000 (32-bit, non-prefetchable)
Region 1: Memory at b0000000 (64-bit, prefetchable)
Region 3: Memory at c0000000 (64-bit, prefetchable)
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

Memory region seems to be fine (assigned). One of the kernel modules is “nouveau.” Could this be causing the problem?

Here is the lspci output with sudo

03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K40c] (rev a1)
Subsystem: NVIDIA Corporation Device 0983
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 32
Region 0: Memory at ce000000 (32-bit, non-prefetchable)
Region 1: Memory at b0000000 (64-bit, prefetchable)
Region 3: Memory at c0000000 (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 00000000fee00a18 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM unknown, Latency L0 <512ns, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

You stated previously that you had removed nouveau in its entirety, yet the output from lspci above shows a nouveau kernel module. That appears to be a contradiction?

I was under the impression that I had removed the noveau driver. This is the output for yum

sudo yum remove xorg-x11-drv-nouveau.x86_64
Loaded plugins: refresh-packagekit, rhnplugin, security
This system is receiving updates from RHN Classic or RHN Satellite.
Setting up Remove Process
No Match for argument: xorg-x11-drv-nouveau.x86_64
Package(s) xorg-x11-drv-nouveau.x86_64 available, but not installed.
No Packages marked for removal

So the line “Kernel modules: nvidia, nouveau, nvidiafb” means that the nouveau driver is currently loaded?
txbob’s suggestion was to remove the nouveau driver. Your comment seems to imply that the kernel module is identical to the driver. Is it? Should I also remove the kernel module as well?

I’m a little confused now…

The information that this was all working a few weeks ago is new information that I was not aware of at the beginning of the thread.

Redhat can install kernel updates that will break the driver installed via the runfile installation method.

This is usually rectified by re-running the driver installer. Now I’m not sure if you recently installed the driver, or simply did that a few weeks ago and are now discovering that it’s not working.

Sorry, I guess I should have mentioned that earlier.

Today, I found out that nvidia-smi wasn’t able to locate the GPU, and none of my old code was working, so I re-installed the cuda toolkit and the latest driver, following the method recommended by NVIDIA (http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#abstract)

Also, is this message something that might be causing this problem? (from dmesg)

NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.

Thank you.

I’m pretty much out of ideas. If, by chance, the previous driver install method was not by runfile but instead by the repo method, then that could be an issue.

the results of:

sudo yum list nvidia-*

would help to rule that possibility in/out

Somebody having installed CUDA through yum is a possibility, as I’m not the only one with sudo access to the system. (I always use the run file method).

The output for sudo yum list nvidia-* is

Loaded plugins: refresh-packagekit, rhnplugin, security
This system is receiving updates from RHN Classic or RHN Satellite.
Available Packages
nvidia-kmod.x86_64 1:346.46-2.el6 cuda
nvidia-modprobe.x86_64 319.37-1.el6 cuda
nvidia-settings.x86_64 319.37-30.el6 cuda
nvidia-uvm-kmod.x86_64 1:346.46-3.el6 cuda
nvidia-xconfig.x86_64 319.37-27.el6 cuda

Yes, it’s a problem. It did not escape my attention earlier, but for me it merely confirms what we already know: the driver is not running correctly.

The yum list nvidia-* output doesn’t indicate any nvidia modules installed, so it does not appear to me that there is any issue with a previous yum/repo installation.

I would ordinarily assume that if you did a driver install via runfile, that the driver install completed successfully. There would usually be a message to that effect. If it did not complete successfully, there may be useful information in the driver installer log file. That would usually be deposited in:

/var/log/nvidia-installer.log

that file is difficult to parse, but if, for example there were messages in there about “unable to locate kernel headers”, then that would be indicative of a problem that could have been triggered by a redhat update (although it should also have given a clear error message when you ran the driver installer.)

For completeness, I don’t believe the reference to nouveau like this:

Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

is an issue.

Hi, I realize this thread is three years old now, but I have the exact same problem. For what it is worth, my system was running just fine, when it suddenly crashed and after that has been giving me the saeme problems (RmInitAdapter failure) and GPU not detected by nvidia-smi.

Did you finally manage to fix this issue?

BTW, I have tried all the suggestions here – purging and reinstalling CUDA etc. No luck…

I have the same problem too , anyone solve this ?
ubuntu 16.04 nvidia 384.130
this happened when next time boot, after a long time DNN compute

Did anyone found the solution to it?
according to me its not a kernel issue or a driver issue.
I use Linux and windows in dual boot. I got struck by BSOD with a GPU related issue I don’t remember the proper issue and since then my windows didn’t recognize my GPU. But i can still see my GPU in device manager. I tried everything from installing new drivers to the factory drivers then i also made a fresh install of windows. Still no luck then i jumped to linux and still i cant shift to my nvidia GPU. I tried LTS kernel to ZEN kernel still no luck and ofcs latest kernel too. Still no luck i have the same results as the lemonherb had from the commands he mentioned.