Driver installed (allegedly), but no devices found through nvidia-smi

I’ve read similar threads on this topic, but so far none have helped me diagnose this issue.
Environment - Bare Metal Ubuntu 22.04.3 LTS (fresh install)
Card - GeForce GT 1030

I’m adding this card for basic object detection in an NVR platform. I don’t even need it to run a monitor, just use the GPU for hw acceleration in this particular NVR platform.

I had this same issue on a loaded/configured machine yesterday, but couldn’t resolve it, so I completely started from scratch with a fresh ubuntu install, and am still having the same results. I did not let ubuntu install the driver automatically in the original ubuntu install. Instead, I went through this step by step for CUDA and Driver installation meticulously.

It appears the driver is installed when looking at lshw but i cannot get nvidia-smi to see any devices.

Could this onboard graphics controller causing a conflict here?

I’ve reviewed the bug report and see that the driver install actually fails, but I’m stumped on what to investigate next.

nvidia-bug-report.log (1.6 MB)

ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries ‘Kernel module load error’ and ‘Kernel messages’ at the end of the file ‘/var/log/nvidia-installer.log’ for more information.

Apparently something is wrong with your driver install.

I see that error line, but have purged and reinstalled countless times, both through the runtime option as a dl from nvidia and through apt. I keep getting the same behavior either way, unfortunately.

I’ve made a little progress here, but still am unable to get nvidia-smi to output any devices.

I started fresh, cleaned all nvidia drivers, libraries, etc out and went from the beginning, even with a slightly more powerful card. (GT 1050 TI) but this time using the runtime versions via the cuda installer instead of the apt package versions. It seemed to help a bit, but still not solved…

  • This time, the driver installations were a success, versus previous attempts would fail right at the end.
  • In the cuda installer log, I found this at the end…
  • DKMS is installed,proceed with dkms install

[INFO]: previous version of nvidia-fs is not installed, nvidia-fs version: 2.17.5 will be installed.
[INFO]: getting mofed Status
[INFO]: installation status shows that mofed is not installed,please install mofed before continuing nvidia>
[ERROR]: Install of nvidia-fs failed, quitting

Other important changes…

  • Secure Boot is OFF
  • I added “pci=realloc=off” as a default environment variable
  • Nouveau is blacklisted
  • Nvidia is NOT blacklisted
  • I see the drivers installed and active when I run lshw -c display
  • I see the nvidia devices present in /dev/dri and /dev/ folders
  • Yet, nvidia-smi still says “No devices found”

I just did a reboot and ran dmesg to find any mentions of the load failing, and I noticed this as well…

20.758860] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input5
[ 20.758937] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input6
[ 20.758994] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input7
[ 20.759069] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input8
[ 20.759149] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input9
[ 20.759225] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input10
[ 21.414843] nvidia: module license ‘NVIDIA’ taints kernel.
[ 21.414849] Disabling lock debugging due to kernel taint
[ 21.488564] nvidia-nvlink: Nvlink Core is being initialized, major device number 234

[ 21.490140] nvidia 0000:04:00.0: enabling device (0000 → 0003)
[ 21.492947] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
> NVRM: BAR3 is 0M @ 0x0 (PCI:0000:04:00.0)
> [ 21.492956] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
> NVRM: BAR4 is 0M @ 0x0 (PCI:0000:04:00.0)
> [ 21.492968] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 21.608880] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.104.05 Sat Aug 19 01:15:15 UTC 2023
[ 21.966325] apex 0000:06:00.0: Apex performance not throttled due to temperature
[ 22.216208] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.104.05 Sat Aug 19 00:59:57 UTC 2023
[ 22.252458] dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-3.4)
[ 22.541322] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 22.541326] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1

This sent me down a rabbithole on the baremetal hardware side of things and found several posts mentioning to turn on “Above 4G processing” in the BIOS. That’s not a setting this machine has, and I’m running the latest/final BIOS there.

Is the dmesg error above telling me that I’m SOL with this motherboard because the PCI slot that fits the card only handles a certain number of something that the card is trying to load? Or… and I overthinking this, and its really a software issue. I’m comfortable giving up if its the motherboard, but if it is a hardware issue, I’d still like to solve it.

Thanks for any thoughts from the experts here.