I’ve made a little progress here, but still am unable to get nvidia-smi to output any devices.
I started fresh, cleaned all nvidia drivers, libraries, etc out and went from the beginning, even with a slightly more powerful card. (GT 1050 TI) but this time using the runtime versions via the cuda installer instead of the apt package versions. It seemed to help a bit, but still not solved…
- This time, the driver installations were a success, versus previous attempts would fail right at the end.
- In the cuda installer log, I found this at the end…
-
DKMS is installed,proceed with dkms install
[INFO]: previous version of nvidia-fs is not installed, nvidia-fs version: 2.17.5 will be installed.
[INFO]: getting mofed Status
[INFO]: installation status shows that mofed is not installed,please install mofed before continuing nvidia>
[ERROR]: Install of nvidia-fs failed, quitting
Other important changes…
- Secure Boot is OFF
- I added “pci=realloc=off” as a default environment variable
- Nouveau is blacklisted
- Nvidia is NOT blacklisted
- I see the drivers installed and active when I run lshw -c display
- I see the nvidia devices present in /dev/dri and /dev/ folders
- Yet, nvidia-smi still says “No devices found”
I just did a reboot and ran dmesg to find any mentions of the load failing, and I noticed this as well…
20.758860] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input5
[ 20.758937] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input6
[ 20.758994] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input7
[ 20.759069] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input8
[ 20.759149] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input9
[ 20.759225] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card0/input10
[ 21.414843] nvidia: module license ‘NVIDIA’ taints kernel.
[ 21.414849] Disabling lock debugging due to kernel taint
[ 21.488564] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[ 21.490140] nvidia 0000:04:00.0: enabling device (0000 → 0003)
[ 21.492947] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
> NVRM: BAR3 is 0M @ 0x0 (PCI:0000:04:00.0)
> [ 21.492956] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
> NVRM: BAR4 is 0M @ 0x0 (PCI:0000:04:00.0)
> [ 21.492968] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 21.608880] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.104.05 Sat Aug 19 01:15:15 UTC 2023
[ 21.966325] apex 0000:06:00.0: Apex performance not throttled due to temperature
[ 22.216208] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.104.05 Sat Aug 19 00:59:57 UTC 2023
[ 22.252458] dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-3.4)
[ 22.541322] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 22.541326] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1
This sent me down a rabbithole on the baremetal hardware side of things and found several posts mentioning to turn on “Above 4G processing” in the BIOS. That’s not a setting this machine has, and I’m running the latest/final BIOS there.
Is the dmesg error above telling me that I’m SOL with this motherboard because the PCI slot that fits the card only handles a certain number of something that the card is trying to load? Or… and I overthinking this, and its really a software issue. I’m comfortable giving up if its the motherboard, but if it is a hardware issue, I’d still like to solve it.
Thanks for any thoughts from the experts here.