I am attempting to install a Tesla K80 GPU on my Dell PowerEdge R810 server, running Ubuntu 20.04. After using a breakout board for an extra PSU, and getting ancillary power to the card, I am now seeing the card when using the lspci
command:
11:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
12:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Note that it is appearing twice – when I have one card installed.
However, the system does not seem to detect the card:
Running nvidia-smi
results in:
No devices were found
After compiling and running the CUDA sample deviceQuery script:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 100
→ no CUDA-capable device is detected
Result = FAIL
Here is the output of lspci -vvv | grep -i -A 20 nvidia
11:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 42
Region 0: Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at <ignored> (64-bit, prefetchable)
Region 3: Memory at <ignored> (64-bit, prefetchable)
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
12:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 42
Region 0: Memory at f7000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at (64-bit, prefetchable)
Region 3: Memory at (64-bit, prefetchable)
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
Here is the output of dmesg |grep NVRM
P[ 77.242573] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.32.03 Sun Dec 27 19:00:34 UTC 2020 [ 93.301821] NVRM: GPU 0000:11:00.0: RmInitAdapter failed! (0x23:0xffff:624) [ 93.301971] NVRM: GPU 0000:11:00.0: rm_init_adapter failed, device minor number 0
(Similar errors looks to be repeated many times)
Lastly, I ran the nvidia-bug-report.sh script and have attached the log
Any help would be very appreciated. I am very new to all this, and hopeful I can get this GPU working on my server.nvidia-bug-report.log (1.1 MB)