nvidia-smi "No devices were found" error

lemonherb · June 22, 2015, 6:42pm

Hi.

I am trying to get my K40c running on a redhat enterprise linux system.
However, I’m having trouble getting nvidia-smi to recognize the GPU; I get the “No devices were found” error when I typ “nvidia-smi -a”

I installed the CUDA 7.0 toolkit, then upgraded the driver to 346.59, and then rebooted the system.

Here are some info:

ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Jun 22 14:22 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jun 22 14:22 /dev/nvidiactl

lspci | grep -i nvidia
03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K40c] (rev a1)

modprobe -l | grep nvidia
kernel/drivers/video/backlight/mbp_nvidia_bl.ko
kernel/drivers/video/nvidia/nvidiafb.ko
extra/nvidia-uvm.ko
extra/nvidia.ko

lsmod | grep nvidia
nvidia 8368780 0
i2c_core 31084 4 igb,i2c_algo_bit,i2c_i801,nvidia

modinfo nvidia
filename: /lib/modules/2.6.32-431.5.1.el6.x86_64/extra/nvidia.ko
alias: char-major-195-*
version: 346.59
supported: external
license: NVIDIA
alias: pci:v000010DEd00000E00svsdbc04sc80i00*
alias: pci:v000010DEd00000AA3svsdbc0Bsc40i00*
alias: pci:v000010DEdsvsdbc03sc02i00
alias: pci:v000010DEdsvsdbc03sc00i00
depends: i2c-core
vermagic: 2.6.32-431.5.1.el6.x86_64 SMP mod_unload modversions
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_RemapLimit:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_CheckPCIConfigSpace:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RmMsg:charp
parm: NVreg_AssignGpus:charp

Robert_Crovella · June 22, 2015, 7:17pm

how did you install the toolkit?
how did you upgrade the driver?

is the behavior any different if you run nvidia-smi as root?

what is the result of running:

dmesg | grep NVRM

?

lemonherb · June 22, 2015, 7:22pm

I installed the toolkit and the driver by downloading and running cuda_7.0.28_linux.run and NVIDIA-Linux-x86_64-346.59.run (in that order).

sudo nvidia-smi -a gives me the same error message, “No devices were found.”

dmesg | grep NVRM
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 346.59 Tue Mar 31 14:10:31 PDT 2015
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.
NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5

Robert_Crovella · June 22, 2015, 7:38pm

It may be an interaction with nouveau. Have you explicitly removed the nouveau driver?

The method is described in the linux getting started guide:

[url]http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#runfile-nouveau[/url]

lemonherb · June 22, 2015, 8:01pm

neither ‘lsmod | grep nouveau’ nor ’ sudo lsmod | grep nouveau’
returns anything.

However, I went and created the blacklist file for RHEL and rebooted the system, and nvidia-smi still can’t locate the GPU.

Robert_Crovella · June 22, 2015, 9:34pm

The blacklist file alone is not necessarily enough. If nouveau is in the initrd, it must be removed from that as well.

It may also be a BIOS problem with the system that it is plugged into. To discover that you would need to run

lspci -vvv

and focus on the allocations for the K40 GPU. Following the sequence here:

[url]https://devtalk.nvidia.com/default/topic/816404/cuda-programming-and-performance/plugging-tesla-k80-results-in-pci-resource-allocation-error-/[/url]

lemonherb · June 22, 2015, 9:43pm

I’m fairly certain noveau’s not an issue, as I uninstalled it using yum (remove xorg-x11-drv-nouveau.x86_64). Also, if it was loaded, it would probably show up in lsmod.
(Also, I did dracut --force).
I’ll try what’s on the link, but I don’t see how the BIOS could be an issue, as it was working fine just a few weeks ago (I haven’t touched it since the last time I used it).
Only thing I could think of is, Redhat installed something automatically (maybe a system update?) that’s causing this issue.

I’ll update this post after I’ve tried your suggestion.

Thank you,

lemonherb · June 22, 2015, 9:56pm

This is the output from lspci -vvv

03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K40c] (rev a1)
Subsystem: NVIDIA Corporation Device 0983
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 32
Region 0: Memory at ce000000 (32-bit, non-prefetchable)
Region 1: Memory at b0000000 (64-bit, prefetchable)
Region 3: Memory at c0000000 (64-bit, prefetchable)
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

Memory region seems to be fine (assigned). One of the kernel modules is “nouveau.” Could this be causing the problem?

lemonherb · June 22, 2015, 10:05pm

Here is the lspci output with sudo

03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K40c] (rev a1)
Subsystem: NVIDIA Corporation Device 0983
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 32
Region 0: Memory at ce000000 (32-bit, non-prefetchable)
Region 1: Memory at b0000000 (64-bit, prefetchable)
Region 3: Memory at c0000000 (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 00000000fee00a18 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM unknown, Latency L0 <512ns, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

njuffa · June 22, 2015, 10:13pm

You stated previously that you had removed nouveau in its entirety, yet the output from lspci above shows a nouveau kernel module. That appears to be a contradiction?

lemonherb · June 22, 2015, 10:18pm

I was under the impression that I had removed the noveau driver. This is the output for yum

sudo yum remove xorg-x11-drv-nouveau.x86_64
Loaded plugins: refresh-packagekit, rhnplugin, security
This system is receiving updates from RHN Classic or RHN Satellite.
Setting up Remove Process
No Match for argument: xorg-x11-drv-nouveau.x86_64
Package(s) xorg-x11-drv-nouveau.x86_64 available, but not installed.
No Packages marked for removal

So the line “Kernel modules: nvidia, nouveau, nvidiafb” means that the nouveau driver is currently loaded?
txbob’s suggestion was to remove the nouveau driver. Your comment seems to imply that the kernel module is identical to the driver. Is it? Should I also remove the kernel module as well?

I’m a little confused now…

Robert_Crovella · June 22, 2015, 10:36pm

The information that this was all working a few weeks ago is new information that I was not aware of at the beginning of the thread.

Redhat can install kernel updates that will break the driver installed via the runfile installation method.

This is usually rectified by re-running the driver installer. Now I’m not sure if you recently installed the driver, or simply did that a few weeks ago and are now discovering that it’s not working.

lemonherb · June 22, 2015, 10:40pm

Sorry, I guess I should have mentioned that earlier.

Today, I found out that nvidia-smi wasn’t able to locate the GPU, and none of my old code was working, so I re-installed the cuda toolkit and the latest driver, following the method recommended by NVIDIA (http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#abstract)

Also, is this message something that might be causing this problem? (from dmesg)

NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.

Thank you.

Robert_Crovella · June 22, 2015, 10:47pm

I’m pretty much out of ideas. If, by chance, the previous driver install method was not by runfile but instead by the repo method, then that could be an issue.

the results of:

sudo yum list nvidia-*

would help to rule that possibility in/out

lemonherb · June 22, 2015, 10:49pm

Somebody having installed CUDA through yum is a possibility, as I’m not the only one with sudo access to the system. (I always use the run file method).

The output for sudo yum list nvidia-* is

Loaded plugins: refresh-packagekit, rhnplugin, security
This system is receiving updates from RHN Classic or RHN Satellite.
Available Packages
nvidia-kmod.x86_64 1:346.46-2.el6 cuda
nvidia-modprobe.x86_64 319.37-1.el6 cuda
nvidia-settings.x86_64 319.37-30.el6 cuda
nvidia-uvm-kmod.x86_64 1:346.46-3.el6 cuda
nvidia-xconfig.x86_64 319.37-27.el6 cuda

Robert_Crovella · June 22, 2015, 11:06pm

Yes, it’s a problem. It did not escape my attention earlier, but for me it merely confirms what we already know: the driver is not running correctly.

The yum list nvidia-* output doesn’t indicate any nvidia modules installed, so it does not appear to me that there is any issue with a previous yum/repo installation.

I would ordinarily assume that if you did a driver install via runfile, that the driver install completed successfully. There would usually be a message to that effect. If it did not complete successfully, there may be useful information in the driver installer log file. That would usually be deposited in:

/var/log/nvidia-installer.log

that file is difficult to parse, but if, for example there were messages in there about “unable to locate kernel headers”, then that would be indicative of a problem that could have been triggered by a redhat update (although it should also have given a clear error message when you ran the driver installer.)

For completeness, I don’t believe the reference to nouveau like this:

Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

is an issue.

bv1 · August 20, 2018, 5:08am

Hi, I realize this thread is three years old now, but I have the exact same problem. For what it is worth, my system was running just fine, when it suddenly crashed and after that has been giving me the saeme problems (RmInitAdapter failure) and GPU not detected by nvidia-smi.

Did you finally manage to fix this issue?

bv1 · August 20, 2018, 5:09am

BTW, I have tried all the suggestions here – purging and reinstalling CUDA etc. No luck…

taoisym · November 8, 2018, 2:48am

I have the same problem too , anyone solve this ?
ubuntu 16.04 nvidia 384.130
this happened when next time boot, after a long time DNN compute

divyansh011998 · September 6, 2020, 10:20pm

Did anyone found the solution to it?
according to me its not a kernel issue or a driver issue.
I use Linux and windows in dual boot. I got struck by BSOD with a GPU related issue I don’t remember the proper issue and since then my windows didn’t recognize my GPU. But i can still see my GPU in device manager. I tried everything from installing new drivers to the factory drivers then i also made a fresh install of windows. Still no luck then i jumped to linux and still i cant shift to my nvidia GPU. I tried LTS kernel to ZEN kernel still no luck and ofcs latest kernel too. Still no luck i have the same results as the lemonherb had from the commands he mentioned.

Topic		Replies	Views
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running Drivers - Linux, Windows, MacOS cuda	38	28561	August 29, 2024
Black screen after install of nvidia driver ubuntu Linux	224	159786	February 27, 2025
cudaGetDeviceCount not detecting multiple cards CUDA Setup and Installation	11	2324	September 8, 2015
NVIDIA driver is not loaded. Ubuntu 18.10 Linux	310	129526	February 14, 2024
Nvidia-smi "No devices were found" - VMWare ESXI Ubuntu Server 20.04.03 with RTX3070 Linux cuda , ubuntu , driver , nvidia-smi , linux-driver-solutions	48	32292	August 25, 2024
RmInitAdapter failed! since kernel > 6.4 Linux kernel	28	3547	November 5, 2024
'No devices were found' after installing cuda 11.02 on Ubuntu 20.04 for RTX3080 Linux cuda , ubuntu , driver	19	12595	July 31, 2021
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371465	March 19, 2021
Broken GPU state query failure in AMD + H100 Confidential Computing	10	1000	February 15, 2024
Nvidia command cannot see second GPU CUDA Setup and Installation cuda , ubuntu , nvbugs	1	2116	August 30, 2022

nvidia-smi "No devices were found" error

Related topics