Troubleshooting Tesla K80 on Dell PowerEdge R810 running Ubuntu 20.04

bsack · February 14, 2021, 7:47pm

I am attempting to install a Tesla K80 GPU on my Dell PowerEdge R810 server, running Ubuntu 20.04. After using a breakout board for an extra PSU, and getting ancillary power to the card, I am now seeing the card when using the lspci command:

11:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
12:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Note that it is appearing twice – when I have one card installed.

However, the system does not seem to detect the card:

Running nvidia-smi results in:

No devices were found

After compiling and running the CUDA sample deviceQuery script:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100
→ no CUDA-capable device is detected
Result = FAIL

Here is the output of lspci -vvv | grep -i -A 20 nvidia

11:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 42
Region 0: Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at <ignored> (64-bit, prefetchable)
Region 3: Memory at <ignored> (64-bit, prefetchable)
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

12:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 42
Region 0: Memory at f7000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at (64-bit, prefetchable)
Region 3: Memory at (64-bit, prefetchable)
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Here is the output of dmesg |grep NVRM

P[ 77.242573] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.32.03 Sun Dec 27 19:00:34 UTC 2020 [ 93.301821] NVRM: GPU 0000:11:00.0: RmInitAdapter failed! (0x23:0xffff:624) [ 93.301971] NVRM: GPU 0000:11:00.0: rm_init_adapter failed, device minor number 0

(Similar errors looks to be repeated many times)

Lastly, I ran the nvidia-bug-report.sh script and have attached the log

Any help would be very appreciated. I am very new to all this, and hopeful I can get this GPU working on my server.nvidia-bug-report.log (1.1 MB)

njuffa · February 15, 2021, 9:29am

K80 is a 2-in-1 card, so two GPUs showing up is normal.

NVIDIA’s sales channel for K80s and other Tesla devices are approved system integrators that have the necessary know-how for setting up these system. So you are in unsupported territory, which is not the best place to be for a self-described newbie. It may turn out your server is not capable of running with a K80.

Make sure the system BIOS on this machine is updated to the latest available. There may be issues with the large PCIe memory aperture (BAR) requirements of the K80. I vaguely recall an aperture size of 128 MB or 256 MB. Also, the memory needs to be mapped above 4 GB, as I recall. Check your system BIOS settings for appropriate configuration options. Something like “large BAR” and “above 4GB decoding”.

The K80 is a passively cooled device, make sure your systems fans are providing adequate airflow across the K90’s heat sink. The K80 is also power hungry, so make sure it is hooked up via the required number of auxiliary PCIe power connectors and that the power supply (PSU) in the system provides sufficient wattage. BTW, what’s a “breakout board for an extra PSU”?

nvidia-smi talks directly to the NVIDIA driver. If it does not recognize the K80, there is no point in trying further up the software stack.

Topic		Replies	Views
Driver/CUDA will not install - Telsa K80 - PowerEdge C4130 Linux	1	858	September 14, 2016
Tesla K80 with standard mother board for dual GPUs with Titan X CUDA Setup and Installation	0	995	June 7, 2016
Problem with Installing Tesla K40C on HP Proliant DL580 G7 with Debian 7 (AMD64) Linux	1	2307	February 23, 2014
Total Hang on nvidia-smi and CUDA Sample Runs on Multi k80 System CUDA Setup and Installation	1	2182	May 31, 2016
Tesla M2070 on Ubuntu 10.4 doesn't recognized CUDA Programming and Performance	1	4318	August 30, 2011
Tesla K80 one GPU visible CUDA Setup and Installation	4	2367	January 11, 2018
Quadro 2000 and Tesla K20c in Ubuntu 12.04 CUDA Setup and Installation	2	1810	January 9, 2014
is my Tesla card broken? CUDA Programming and Performance	3	2852	March 6, 2012
Tesla K40m not recognized, but driver works find for Quadro K420 CUDA Setup and Installation	3	1429	June 22, 2016
Driver doesn't see my Tesla C1060 CUDA Programming and Performance	5	9742	May 25, 2011

Troubleshooting Tesla K80 on Dell PowerEdge R810 running Ubuntu 20.04

Related topics