Troubleshooting Tesla K80 on Dell PowerEdge R810 running Ubuntu 20.04

I am attempting to install a Tesla K80 GPU on my Dell PowerEdge R810 server, running Ubuntu 20.04. After using a breakout board for an extra PSU, and getting ancillary power to the card, I am now seeing the card when using the lspci command:

11:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
12:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Note that it is appearing twice – when I have one card installed.

However, the system does not seem to detect the card:

Running nvidia-smi results in:

No devices were found

After compiling and running the CUDA sample deviceQuery script:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100
→ no CUDA-capable device is detected
Result = FAIL

Here is the output of lspci -vvv | grep -i -A 20 nvidia

11:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 42
Region 0: Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at <ignored> (64-bit, prefetchable)
Region 3: Memory at <ignored> (64-bit, prefetchable)
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

12:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 42
Region 0: Memory at f7000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at (64-bit, prefetchable)
Region 3: Memory at (64-bit, prefetchable)
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Here is the output of dmesg |grep NVRM

P[ 77.242573] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.32.03 Sun Dec 27 19:00:34 UTC 2020 [ 93.301821] NVRM: GPU 0000:11:00.0: RmInitAdapter failed! (0x23:0xffff:624) [ 93.301971] NVRM: GPU 0000:11:00.0: rm_init_adapter failed, device minor number 0

(Similar errors looks to be repeated many times)

Lastly, I ran the nvidia-bug-report.sh script and have attached the log

Any help would be very appreciated. I am very new to all this, and hopeful I can get this GPU working on my server.nvidia-bug-report.log (1.1 MB)

K80 is a 2-in-1 card, so two GPUs showing up is normal.

NVIDIA’s sales channel for K80s and other Tesla devices are approved system integrators that have the necessary know-how for setting up these system. So you are in unsupported territory, which is not the best place to be for a self-described newbie. It may turn out your server is not capable of running with a K80.

Make sure the system BIOS on this machine is updated to the latest available. There may be issues with the large PCIe memory aperture (BAR) requirements of the K80. I vaguely recall an aperture size of 128 MB or 256 MB. Also, the memory needs to be mapped above 4 GB, as I recall. Check your system BIOS settings for appropriate configuration options. Something like “large BAR” and “above 4GB decoding”.

The K80 is a passively cooled device, make sure your systems fans are providing adequate airflow across the K90’s heat sink. The K80 is also power hungry, so make sure it is hooked up via the required number of auxiliary PCIe power connectors and that the power supply (PSU) in the system provides sufficient wattage. BTW, what’s a “breakout board for an extra PSU”?

nvidia-smi talks directly to the NVIDIA driver. If it does not recognize the K80, there is no point in trying further up the software stack.