NVIDIA P40; Dell Workstation; Ubuntu 20.04; Drivers don't work, Kernel errors, possible PCIe configuration problem

Hello all! The TLDR is that I’m trying to set up a personal rig with a Tesla P40 I was able to buy cheaply, for hobby AI projects (I was recently a grad student doing this research, but I chose to leave and downgrade my AI involvement to a hobby). I bought a DELL Optiplex 7020 Minitower, installed Ubuntu on it, and was able to see the card using lspci ; however, no matter what I do, I cannot get the drivers to run, and I’m getting kernel errors. Reading up on the topic, this seems to be a common occurrence with high-grade GPUs and Dell machines, specifically having to do with PCIe configurations, but the standard fixes don’t seem to work for me.

I basically have 2 questions:

  1. If there is a way to make the two play together (and how do I do it)?
  2. If the two have incompatible PCIe expectations, what machines are known to WORK with the Tesla P40? How can I buy a box that won’t have these issues?

More details:

Here are the specs for the Dell box: https://www.amazon.com/gp/product/B07ZDKDXRX/ref=ppx_yo_dt_b_asin_title_o05_s00?ie=UTF8&psc=1

I installed Ubuntu 20.04, and I see the P40 when I run lspci .

From there, I followed the Linux CUDA installation guide. CUDA Installation Guide for Linux

The final tests don’t work, and nvidia-smi says it can’t detect a device. I know where the problem is: the drivers I install won’t run, no matter what. I’ve tried installing several different available nvidia drivers, that I saw by running

ubuntu-drivers devices

(The reason I know the drivers aren’t running is that the folder /proc/driver/nvidia doesn’t exist, and everything I’ve read has suggested that )

Specifically, the bug report below was generated when I had nvidia-driver-510-server, but I tried the non-server option, the server option, I tried 495, 470, everything.

nvidia-bug-report.log.gz (1.3 MB)

Looking into logs (specifically /var/log/syslog ) I see a lot of the following kernel errors:

4 Apr 13 17:11:33 penguins-army kernel: [    6.737243] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
    5 Apr 13 17:11:33 penguins-army kernel: [    6.737249] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
    6 Apr 13 17:11:33 penguins-army kernel: [    6.737249] NVRM: BAR1 is 0M @ 0x0 (PCI:0000:04:00.0)
    7 Apr 13 17:11:33 penguins-army kernel: [    6.738488] nvidia: probe of 0000:04:00.0 failed with error -1
    8 Apr 13 17:11:33 penguins-army kernel: [    6.738508] NVRM: The NVIDIA probe routine failed for 1 device(s).
    9 Apr 13 17:11:33 penguins-army kernel: [    6.738509] NVRM: None of the NVIDIA devices were initialized.
   10 Apr 13 17:11:33 penguins-army kernel: [    6.738670] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234

Googling this error, I see the following NVIDIA developer forum posts, which suggest this is a PCIe configuration problem, though it’s talking about a different Dell box and a different GPU:

However, I can’t seem to find those settings in the BIOS for my machine. Not sure why.

I also found this forum, which suggests that either NVIDIA or Dell frustratingly tried to protect users from themselves, and deliverately made datacenter GPUs incompatible with workstations:

If that’s the case (and I’m not sure it is; the replies to that post seem to deal with overheating, which is different from what I’m dealing with, since I’m not loading the GPU at all), I don’t mind buying a server, but I don’t want to be disappointed again (I’ve killed so much time digging through these issues outside my usual competency). What servers are known to play well with this card? What PCIe settings can I google to check that the two would be compatible?

Thank you very much for your time.

Table 7 of this document lists P40 supported servers:

I doubt there is any deliberate intent. More a case of the card, being designed to fit efficiently into an environment that is not normally catered to by run of the mill PC hardware.

In addition to larger than normal PCIe BAR requirements, this card has no cooling fan fitted as it’s intended to be installed in a chassis designed to ensure that air is forced through it at the required rate. This can be done outside of the approved systems, but requires a degree of hands on and research.

2 Likes

Thanks a lot for the link; while I’m still trying to fix my system, I’ll look through it, find the cheapest servers.

As for the fan, I bought a CPU fan and added a 3d-printed air funnel :)

Do you know of any older versions of the BIOS for this machine that I can flash onto the system? Maybe there’s an alternative BIOS where these lower-level functions are exposed. I just have no idea how to search for these things.

Sorry, no.

Did similar with a Tesla K20X, but used a fairly beefy blower style fan. Am not sure a CPU fan will move enough air, but you’ll get an idea from monitoring temps.

For some reason the Product Brief for the P40 make no mention of it, but the P100 one lists airflow requirements, also uses 250W, and is of similar construction.

1 Like

Yeah . . . I’d love for the fan to be my primary problem, especially since I know to look for it and it has solutions. Thanks again.

If those are reallly all bios settings
https://www.dell.com/support/manuals/de-de/optiplex-7020-desktop/opt7020sffompub-v1/system-setup-optionen?guid=guid-212377c4-3cc4-4633-86c6-53ab926b50fc&lang=en-us
then it’s really limited for a “Workstation”. Just to make sure, you’re using EFI boot and have legacy boot disabled, virtualization enabled? Please uninstall the nvidia driver and provide a dmesg output right after boot.

Hey, I returned the workstation and bought the cheapest server from rs277 's Table 7 (Lenovo System x3650 M5), which was about the same price anyway (the market for these is weird; you can get steals sometimes). But yes, it was UEFI boot, virtualization enabled, and I think legacy boot had been disabled, though I’m not sure.

There had been other issues - for instance, the power supplies were separate, so I turned on the power supply for the GPU (and its fan) before turning on the workstation; I read that this shouldn’t be an issue, though it might have been. I’ll try again with an officially supported configuration, since it’s not much more expensive.