nvidia-smi "No devices were found" error

neither ‘lsmod | grep nouveau’ nor ’ sudo lsmod | grep nouveau’
returns anything.

However, I went and created the blacklist file for RHEL and rebooted the system, and nvidia-smi still can’t locate the GPU.

The blacklist file alone is not necessarily enough. If nouveau is in the initrd, it must be removed from that as well.

It may also be a BIOS problem with the system that it is plugged into. To discover that you would need to run

lspci -vvv

and focus on the allocations for the K40 GPU. Following the sequence here:

[url]https://devtalk.nvidia.com/default/topic/816404/cuda-programming-and-performance/plugging-tesla-k80-results-in-pci-resource-allocation-error-/[/url]

I’m fairly certain noveau’s not an issue, as I uninstalled it using yum (remove xorg-x11-drv-nouveau.x86_64). Also, if it was loaded, it would probably show up in lsmod.
(Also, I did dracut --force).
I’ll try what’s on the link, but I don’t see how the BIOS could be an issue, as it was working fine just a few weeks ago (I haven’t touched it since the last time I used it).
Only thing I could think of is, Redhat installed something automatically (maybe a system update?) that’s causing this issue.

I’ll update this post after I’ve tried your suggestion.

Thank you,

This is the output from lspci -vvv

03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K40c] (rev a1)
Subsystem: NVIDIA Corporation Device 0983
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 32
Region 0: Memory at ce000000 (32-bit, non-prefetchable)
Region 1: Memory at b0000000 (64-bit, prefetchable)
Region 3: Memory at c0000000 (64-bit, prefetchable)
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

Memory region seems to be fine (assigned). One of the kernel modules is “nouveau.” Could this be causing the problem?

Here is the lspci output with sudo

03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K40c] (rev a1)
Subsystem: NVIDIA Corporation Device 0983
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 32
Region 0: Memory at ce000000 (32-bit, non-prefetchable)
Region 1: Memory at b0000000 (64-bit, prefetchable)
Region 3: Memory at c0000000 (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 00000000fee00a18 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM unknown, Latency L0 <512ns, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

You stated previously that you had removed nouveau in its entirety, yet the output from lspci above shows a nouveau kernel module. That appears to be a contradiction?

I was under the impression that I had removed the noveau driver. This is the output for yum

sudo yum remove xorg-x11-drv-nouveau.x86_64
Loaded plugins: refresh-packagekit, rhnplugin, security
This system is receiving updates from RHN Classic or RHN Satellite.
Setting up Remove Process
No Match for argument: xorg-x11-drv-nouveau.x86_64
Package(s) xorg-x11-drv-nouveau.x86_64 available, but not installed.
No Packages marked for removal

So the line “Kernel modules: nvidia, nouveau, nvidiafb” means that the nouveau driver is currently loaded?
txbob’s suggestion was to remove the nouveau driver. Your comment seems to imply that the kernel module is identical to the driver. Is it? Should I also remove the kernel module as well?

I’m a little confused now…

The information that this was all working a few weeks ago is new information that I was not aware of at the beginning of the thread.

Redhat can install kernel updates that will break the driver installed via the runfile installation method.

This is usually rectified by re-running the driver installer. Now I’m not sure if you recently installed the driver, or simply did that a few weeks ago and are now discovering that it’s not working.

Sorry, I guess I should have mentioned that earlier.

Today, I found out that nvidia-smi wasn’t able to locate the GPU, and none of my old code was working, so I re-installed the cuda toolkit and the latest driver, following the method recommended by NVIDIA (http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#abstract)

Also, is this message something that might be causing this problem? (from dmesg)

NVRM: RmInitAdapter failed! (0x30:0xffff:800)
NVRM: rm_init_adapter failed for device bearing minor number 0
NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
NVRM: failed to copy vbios to system memory.

Thank you.

I’m pretty much out of ideas. If, by chance, the previous driver install method was not by runfile but instead by the repo method, then that could be an issue.

the results of:

sudo yum list nvidia-*

would help to rule that possibility in/out

Somebody having installed CUDA through yum is a possibility, as I’m not the only one with sudo access to the system. (I always use the run file method).

The output for sudo yum list nvidia-* is

Loaded plugins: refresh-packagekit, rhnplugin, security
This system is receiving updates from RHN Classic or RHN Satellite.
Available Packages
nvidia-kmod.x86_64 1:346.46-2.el6 cuda
nvidia-modprobe.x86_64 319.37-1.el6 cuda
nvidia-settings.x86_64 319.37-30.el6 cuda
nvidia-uvm-kmod.x86_64 1:346.46-3.el6 cuda
nvidia-xconfig.x86_64 319.37-27.el6 cuda

Yes, it’s a problem. It did not escape my attention earlier, but for me it merely confirms what we already know: the driver is not running correctly.

The yum list nvidia-* output doesn’t indicate any nvidia modules installed, so it does not appear to me that there is any issue with a previous yum/repo installation.

I would ordinarily assume that if you did a driver install via runfile, that the driver install completed successfully. There would usually be a message to that effect. If it did not complete successfully, there may be useful information in the driver installer log file. That would usually be deposited in:

/var/log/nvidia-installer.log

that file is difficult to parse, but if, for example there were messages in there about “unable to locate kernel headers”, then that would be indicative of a problem that could have been triggered by a redhat update (although it should also have given a clear error message when you ran the driver installer.)

For completeness, I don’t believe the reference to nouveau like this:

Kernel driver in use: nvidia
Kernel modules: nvidia, nouveau, nvidiafb

is an issue.

Hi, I realize this thread is three years old now, but I have the exact same problem. For what it is worth, my system was running just fine, when it suddenly crashed and after that has been giving me the saeme problems (RmInitAdapter failure) and GPU not detected by nvidia-smi.

Did you finally manage to fix this issue?

BTW, I have tried all the suggestions here – purging and reinstalling CUDA etc. No luck…

I have the same problem too , anyone solve this ?
ubuntu 16.04 nvidia 384.130
this happened when next time boot, after a long time DNN compute

Did anyone found the solution to it?
according to me its not a kernel issue or a driver issue.
I use Linux and windows in dual boot. I got struck by BSOD with a GPU related issue I don’t remember the proper issue and since then my windows didn’t recognize my GPU. But i can still see my GPU in device manager. I tried everything from installing new drivers to the factory drivers then i also made a fresh install of windows. Still no luck then i jumped to linux and still i cant shift to my nvidia GPU. I tried LTS kernel to ZEN kernel still no luck and ofcs latest kernel too. Still no luck i have the same results as the lemonherb had from the commands he mentioned.

Having the same problem with GTX 1650. Tried to run it on Debian 10 and on Ubuntu 20.
Debian 10, on standard and experimental kernel. 440 and 450 drivers. No luck.
Now I am trying to get it working on Ubuntu with drivers 450 included in Ubuntu repos. Still no luck. Everything was done on fresh installs. Below some outputs from commands:

lspci -vvv
> 22:00.0 VGA compatible controller: NVIDIA Corporation TU117 [GeForce GTX 1650] (rev a1) (prog-if 00 [VGA controller])

        Subsystem: Micro-Star International Co., Ltd. [MSI] TU117 [GeForce GTX 1650]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 175
        NUMA node: 1
        Region 0: Memory at d9000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 3c000000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 3c010000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 6000 [size=128]
        Expansion ROM at daf00000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp-, 10BitTagReq-, OBFF Via message, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- > MalfTLP- ECRC- UnsupReq- ACSViol-
            UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
            UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
            CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
            CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
            AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                    MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
            HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900 v1] Secondary PCI Express
            LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
            LaneErrStat: 0
    Capabilities: [bb0 v1] Resizable BAR <?>
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

dmesg | grep NVRM

[ 5.865440] NVRM : loading NVIDIA UNIX x86_64 Kernel Module 450.66 Wed Aug 12 19:42:48 UTC 2020

[ 101.738202] NVRM : GPU 0000:22:00.0: RmInitAdapter failed! (0x26:0xffff:1266)

[ 101.738300] NVRM : GPU 0000:22:00.0: rm_init_adapter failed, device minor number 0

[ 885.973227] NVRM : GPU 0000:22:00.0: RmInitAdapter failed! (0x26:0xffff:1266)

[ 885.973319] NVRM : GPU 0000:22:00.0: rm_init_adapter failed, device minor number 0

[ 942.341436] NVRM : GPU 0000:22:00.0: RmInitAdapter failed! (0x26:0xffff:1266)

[ 942.341481] NVRM : GPU 0000:22:00.0: rm_init_adapter failed, device minor number 0

[ 1403.988729] NVRM : GPU 0000:22:00.0: RmInitAdapter failed! (0x26:0xffff:1266)

[ 1403.988775] NVRM : GPU 0000:22:00.0: rm_init_adapter failed, device minor number 0

[ 1419.358018] NVRM : GPU 0000:22:00.0: RmInitAdapter failed! (0x26:0xffff:1266)

[ 1419.358105] NVRM : GPU 0000:22:00.0: rm_init_adapter failed, device minor number 0

lsmod | grep nvidia

nvidia_uvm 1007616 0
nvidia_drm 53248 0
nvidia_modeset 1183744 1 nvidia_drm
nvidia 19701760 2 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 4 mgag200,nvidia_drm
drm 491520 6 drm_kms_helper,drm_vram_helper,mgag200,nvidia_drm,ttm

debian_dell_440.log.gz (554.8 KB)
ubuntu_dell_450.log.gz (108.6 KB)

Well I kind of fixed it
By tapping on the back of the laptop

All of a sudden started having the issue after a reboot about a week ago as well, using Ubuntu 20.04.

Things i’ve tried so far:

  • Upgrading Ubuntu 20.04
  • Upgrading/Downgrading NVIDIA driver via run file (450.66 / 455.38 / 440.100)
  • Nouveau driver is blacklisted, i tried running another initramfs to update the kernel.

I then purged everything e.g

sudo apt-get remove nvidia-* xserver-xorg-* && sudo apt-get purge nvidia-* xserver-xorg-* && sudo apt-get autoclean && sudo apt-get autoremove

Shutdown, and tested the card in a different system where it ran perfectly. Loaded it up to 100% for 2 hours and completed tasks fine. I then put the card back into my Ubuntu Server and reinstalled, this time via aptitude

sudo apt-get install dkms build-essential linux-headers-$(uname -r)
sudo nano /etc/modprobe.d/blacklist.conf
blacklist nouveau
blacklist nvidiafb
alias nouveau off
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo apt-get install nvidia-headless-450-server nvidia-utils-450-server nvidia-container-runtime nvidia-container-toolkit nvidia-docker2

Still have the same issue

$ lspci | grep NVIDIA
07:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
07:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)

$ dmesg | grep NVRM
[ 1.569623] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020
[ 65.390214] NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x26:0xffff:1266)
[ 65.390403] NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0

$ dmesg | grep NVIDIA
[ 1.450276] nvidia: module license ‘NVIDIA’ taints kernel.
[ 1.569623] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020
[ 1.572949] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020

$ cat /var/log/kern.log | grep taint
Nov 1 14:13:49 mediabox kernel: [ 1.450269] nvidia: loading out-of-tree module taints kernel.
Nov 1 14:13:49 mediabox kernel: [ 1.450276] nvidia: module license ‘NVIDIA’ taints kernel.
Nov 1 14:13:49 mediabox kernel: [ 1.450278] Disabling lock debugging due to kernel taint
Nov 1 14:13:49 mediabox kernel: [ 1.460693] nvidia: module verification failed: signature and/or required key missing - tainting kernel

$ sudo nvidia-smi
No devices were found

Dear All

I have encountered the same problem: Ubuntu 20.04/RTX2060/Cuda-11.2/nvidia460.32.03

I have also tried Cuda-10.1 but with the same result

$ sudo nvidia-smi
No devices were found

$ sudo lspci -vvv indicates nvidia driver in use

UEFI graphic card set to discrete

I want to install cuDNN, but I prefer to wait until CUDA works properly

BTW:
$ nvcc --version

works properly

Best regards!

EDIT ONE DAY LATER:

Dear All
I have decided to reinstall my Ubuntu 20.04.
At the start I’ve checked nvidia-smi and is OK!
The very first thing that I have installed on ‘naked’ ubuntu is CUDA 11.2 and compatible cuDNN.
Here nvidia-smi behaves still well!
Just after that, another packages relying on CUDA&cuDNN were installed.
Everything works fine and I’m able to use accelerated computing :)

Conclusion: installation order is crucial for success. Updating old CUDA/ repairing broken CUDA can often cause corrupted dependencies, even if you purge everything(as I’ve done before).