Nvidia-smi fails to load drivers on linux

Hi,

I have and NVIDIA Telsa M40 on a machine running linux mint 20.3.
The I’ve tried to install all possible drivers from apt. My latest clean install was

sudo apt-get install nvidia-driver-510

Obviously I’ve rebooted the os but anyhow when I try to run nvidia-smi I always get:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

so I’ve run the nvidia-bug-report.sh and attached the result here…

nvidia-bug-report.log.gz (398.4 KB)

1 Like

I have the same issue, I think it failed after the recent live update. My graphic card is Nvidia 3070 dual.

I recently changed my graphics card, maybe there are some conflict which I didn’t manage to resolve 😅

For the Tesla to work you need to enable “Above 4G decoding” in bios, disable CSM and have an EFI boot. Also, Teslas are built for servers, don’t have a fan so you need to have it added if you’re running this in a desktop case.

Thanks for the tip,

I’ll try to change the bios settings.

About the fan I’m printing an adaptor to mount on one end of the card, hope it will help to take away the heat !

Update, i’ve changed all the bios parameters but still it doesn’t work…

is there a way to understand if the card has a problem ? I can see that the card is inserted !

Please uninstall the driver and attach a dmesg output right after reboot.

Hello @generix, I tried to reply to you on a post on another topic /dev/sdb1 : clean, 640729/122388848… and Keyboard is not working - #17 by generix

However I could not reply as i was limited to 3 replies per topic as a new user. I edited my previous post on that topic to attach the bug report /dev/sdb1 : clean, 640729/122388848… and Keyboard is not working - #16 by abdulbaasitsanusi

Is it possible to continue to discussion elsewhere?

Generix, sorry for the late reply…

here is the output of dmesg
dmesg.out (184.8 KB)

64bit resources are still not enabled. Is this a plain old bios or an uefi with csm enabled?

mmm strange, I’m running a 64bit linux system, or am I confusing things ?

The bios was just updated to the latest version it supports UEFI and has csm disabled.
The motherboard is pretty old (from 2012 Asus motherboard).

If this is an uefi board, then you still have csm enabled because the linux install uses a mbr boot. So you will also have to reinstall it after really disabling csm.

64bit resources have nothing to do with the OS, it’s provided by the bios (after enabling “Above 4G decoding/ 64bit BARs”).
The CSM in modern UEFI firmwares is very limited, not capable of much.

so it may be possible that the board is not capable of above 4G decoding. but it sounds strange to me, because I had mounted on the same machine an rtx3060…

Teslas want to map their whole video memory into system address range, 24GB obviously needs 64bit address space.
Normal graphics cards like the 3060 only map 256MB (unless the bios supports rBAR).
Does the bios have a 4G option?

Unfortunately no, I didn’t find that option… I should test the M40 on a board with that option…

I checked the board’s manual and this looks like some very early uefi/bios hybrid. Doesn’t support any Tesla, no dice.

I finally managed to test the GPU on a newer motherboard and IT WORKS.

thanks for guiding me towards the solution!

The issue isn’t always that the motherboard lacks support for large PCI address spaces. Sometimes, the BIOS is technically capable of handling it but fails to properly assign the correct memory regions. This can happen due to firmware bugs, misconfigured PCIe resource allocation, or simply because the BIOS doesn’t know what to do with certain devices. As a result, the GPU gets assigned invalid or unusable memory addresses, making it non-functional.

Luckily, Linux can override this faulty allocation by forcing the kernel to reassign PCIe memory regions dynamically. Adding pci=nocrs to the boot parameters tells the system to ignore the BIOS-assigned addresses and handle PCI resource allocation on its own, which often fixes the issue.

I successfully got the Tesla M40 (2015) running on 2013 hardware, even though it initially refused to work at all. The issue was with PCIe memory assignments, BAR2 and BAR3 weren’t properly allocated. After tweaking the PCI address space and ensuring the correct BAR sizes, the card is now fully functional and running perfectly on a 2013 motherboard with an extremely outdated firmware.

Hello, I am trying to use a 20-year-old server to run the M40 graphics card, but now I am encountering a problem. The server has been unable to read the M40 computing card, but replacing it with a regular graphics card can read it. Do you have any good methods to help me read the graphics card? My server is HP DL380 Gen6, and the graphics card is TESLA M40. Thank you very much