Hello all! The TLDR is that I’m trying to set up a personal rig with a Tesla P40 I was able to buy cheaply, for hobby AI projects (I was recently a grad student doing this research, but I chose to leave and downgrade my AI involvement to a hobby). I bought a DELL Optiplex 7020 Minitower, installed Ubuntu on it, and was able to see the card using lspci ; however, no matter what I do, I cannot get the drivers to run, and I’m getting kernel errors. Reading up on the topic, this seems to be a common occurrence with high-grade GPUs and Dell machines, specifically having to do with PCIe configurations, but the standard fixes don’t seem to work for me.
I basically have 2 questions:
- If there is a way to make the two play together (and how do I do it)?
- If the two have incompatible PCIe expectations, what machines are known to WORK with the Tesla P40? How can I buy a box that won’t have these issues?
More details:
Here are the specs for the Dell box: https://www.amazon.com/gp/product/B07ZDKDXRX/ref=ppx_yo_dt_b_asin_title_o05_s00?ie=UTF8&psc=1
I installed Ubuntu 20.04, and I see the P40 when I run lspci .
From there, I followed the Linux CUDA installation guide. CUDA Installation Guide for Linux
The final tests don’t work, and nvidia-smi says it can’t detect a device. I know where the problem is: the drivers I install won’t run, no matter what. I’ve tried installing several different available nvidia drivers, that I saw by running
ubuntu-drivers devices
(The reason I know the drivers aren’t running is that the folder /proc/driver/nvidia doesn’t exist, and everything I’ve read has suggested that )
Specifically, the bug report below was generated when I had nvidia-driver-510-server, but I tried the non-server option, the server option, I tried 495, 470, everything.
nvidia-bug-report.log.gz (1.3 MB)
Looking into logs (specifically /var/log/syslog ) I see a lot of the following kernel errors:
4 Apr 13 17:11:33 penguins-army kernel: [ 6.737243] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
5 Apr 13 17:11:33 penguins-army kernel: [ 6.737249] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
6 Apr 13 17:11:33 penguins-army kernel: [ 6.737249] NVRM: BAR1 is 0M @ 0x0 (PCI:0000:04:00.0)
7 Apr 13 17:11:33 penguins-army kernel: [ 6.738488] nvidia: probe of 0000:04:00.0 failed with error -1
8 Apr 13 17:11:33 penguins-army kernel: [ 6.738508] NVRM: The NVIDIA probe routine failed for 1 device(s).
9 Apr 13 17:11:33 penguins-army kernel: [ 6.738509] NVRM: None of the NVIDIA devices were initialized.
10 Apr 13 17:11:33 penguins-army kernel: [ 6.738670] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234
Googling this error, I see the following NVIDIA developer forum posts, which suggest this is a PCIe configuration problem, though it’s talking about a different Dell box and a different GPU:
However, I can’t seem to find those settings in the BIOS for my machine. Not sure why.
I also found this forum, which suggests that either NVIDIA or Dell frustratingly tried to protect users from themselves, and deliverately made datacenter GPUs incompatible with workstations:
If that’s the case (and I’m not sure it is; the replies to that post seem to deal with overheating, which is different from what I’m dealing with, since I’m not loading the GPU at all), I don’t mind buying a server, but I don’t want to be disappointed again (I’ve killed so much time digging through these issues outside my usual competency). What servers are known to play well with this card? What PCIe settings can I google to check that the two would be compatible?
Thank you very much for your time.