RTX 4000 SFF for inference only

nvidia-bug-report.log.gz (122.4 KB)

I am trying to get a RTX 4000 SFF running in an industrial PC. I have been using a T4 with success. In this case when I try to use nvidia-smi I get “No devices were found”

I am using this for GPU for inference for images, therefor I would prefer the GPU do not handle any graphics (display) work. In the BIOS of this IPC I was able to choose to use the onboard graphics card.

The GPU does appear when I run $ lspci
01:00.0 3D controller: NVIDIA Corporation Device 27b0 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 22bc (rev a1)

Hi @thomas.p.16,

Please proceed with a clean system install or at least a clean GPU driver install. You have corrupted driver “parts” in your installation which only a complete driver purge or ideally a clean system install will fix.
And please EITHER use the distro installation method OR one of .runfile or CUDA packaged driver. Don’t mix and match.

Some indicators in the log:

Feb 29 17:36:50 nuvo kernel: [  240.784271] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
Feb 29 17:36:50 nuvo kernel: [  242.306188] NVRM: API mismatch: the client has the version 545.23.08, but
Feb 29 17:36:50 nuvo kernel: [  242.306188] NVRM: this kernel module has the version 535.161.07.  Please
Feb 29 17:36:50 nuvo kernel: [  242.306188] NVRM: make sure that this kernel module and all NVIDIA driver
Feb 29 17:36:50 nuvo kernel: [  242.306188] NVRM: components have the same version.

and

-> Installing NVIDIA driver version 550.54.14.
-> Initramfs scan complete.
-> The NVIDIA driver appears to have been installed previously using a different installer. To prevent potential conflicts, it is recommended either to update the existing installation using the same mechanism by which it was originally installed, or to uninstall the existing installation before installing this driver.

Please review the message provided by the maintainer of this alternate installation method and decide how to proceed:

The package that is already installed is named nvidia-driver-545.

You can upgrade the driver by running:
`apt-get install nvidia-driver-545`

You can remove nvidia-driver-545, and all related packages, by running:
`apt-get remove --purge nvidia-driver-545`
`apt-get autoremove`

This package is maintained by NVIDIA (cudatools@nvidia.com).


(Answer: Abort installation)
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation.

Thanks.

This resulted in
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 545.23

`apt-get remove --purge nvidia-driver-545`
`apt-get autoremove`

ran this again with the same results
apt-get install nvidia-driver-545
rebooting now

after rebooting back to, no devices found.
As far as trying a clean system this is my second try.

The debug log showed indicators that you have both v535 and v545 drivers installed.

If you start with a clean install, so NOT install any GPU drivers during operating system installation.

In fact, since you want to use CUDA, I highly recommend following the detailed CUDA installation instructions to the letter.

nvidia-bug-report.log.gz (108.3 KB)

from a fresh install following the detailed CUDA installation.
With one exception, in order to use nvidia-smi I also had to install
sudo apt install nvidia-utils-535

Ok, the GPU somehow still cannot be initialized but I am not completely sure why yet.

I assume that the GPU is seated correctly in its slot and that there is sufficient power supply as well as cooling in place.
The log output

Model: 		 NVIDIA RTX 4000 SFF Ada Generation
IRQ:   		 131
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??

is not really typical.

Also normally you don’t need the utils installation to get nvidia-smi, it is part of the CUDA installation.
Did you reboot the system after all installation steps?

Then it does look a bit like the nouveau driver is still loaded, which would cause this. Can you check with

lsmod | grep nouveau

if it is and then blacklist it for example with

cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

After which you need to do

sudo update-initramfs -u

and then reboot the system.

I hope that helps.