Driver Installation for Tesla K80 - Problems

The problem can not install the Tesla + Quadro driver
individually, the cards do not want to work together.
I tried two OS 8.1 and 2016
different drivers 412, 425, 431 different cuda packages.
freezes upon initialization of the second tesla chip so that the computer restarts and after that the system only boots in safe mode.

mb supermicro x11spi-tf socket 3647
cpu Xeon 8176
os Windows : Microsoft Windows 2016 Сервер 10.00.14393 : x64




The driver included in any recent CUDA installer will work with both GPUs.

If it were me, I wouldn’t individually try to choose a driver such as 412, 425, or 431. Instead use a CUDA installer package.

And you only need one driver for those GPUs. Don’t try to install 2 different drivers.

You also need to make sure that the Tesla K80 is adequately cooled and that the platform (supermicro) supports the K80.

Robert, this surprises me. I’m stuck at the same point. The latest as of today, the 440.44, doesn’t support Tesla, according to the “supported products” tab. Likewise, the latest as of today that supports Tesla, 440.33.01, doesn’t support GTX, according to the “supported products” tab.

So can I have GTX (in the future: RTX) and K80 on the same computer? If so, how?

Thanks!

The driver shipped with the CUDA toolkit installer supports all GPUs that were currently available when that installer was created. Therefore, the driver bundled with the CUDA 10.2 installer supports all GPUs, GeForce, Quadro, and Tesla that were available when the 10.2 CUDA version became available. This certainly includes K80 as well as all GeForce GTX GPUs.

Thanks for your fast reply.

But now there’s something else I don’t understand. I found that the driver that ships with the CUDA toolkit doesn’t work for me, I always need to go to https://www.nvidia.com/Download/index.aspx?lang=en-us and use the driver downloaded from there. At least, nvidia-smi doesn’t work with the CUDA-supplied driver. The only way I can get nvidia-smi to work is by using the .run file downloaded from that URL.

To be fair, I never tried it again in recent years (I found all that .rpm stuff never really worked as far as CUDA / NVidia is concerned, the people on #rpmfusion were more defensive than constructive, and since I started using the .run files everything started to work like a charm).

I also didn’t try the 440.44 on the K80, perhaps it does actually work, even though it’s not listed under “supported products” at that URL above. I have the same problem that others here have: can’t get the K80 to work in a workstation. This https://www.ebay.com/itm/GPU-Cooler-with-Quiet-Fan-for-Nvidia-Tesla-K80-M40-Passive-Cooling-Model-B/123949246966 seemed a perfect cooling solution. But I have power connector problems (6pin / 8 pin problems), I can’t even boot and make that 4G change in the BIOS. But that way I can’t even test if the 440.44 does the trick, it might actually (but then, why does “supported products” not list it?).

I know you try to discourage people from trying to use the K80 in a workstation, I’ve read this like a dozen times here. I still don’t want to give up – yet. The custom cooling solution seems to solve that problem. Once I figure out the power supply connector problem (it’s an unusual socket, normally that is on the mobo, not on the graphics card, I can’t find any cabling for that), I can make the 4G change in the BIOS. One of three problems solved, two problems not solved yet, don’t want to give up yet.

Because if I can get this to work, it would be really sweet. These K80s are getting cheaper and cheaper, and if I can get this to work, I could augment each of my computers with a K80 for low money. As a CUDA nerd, this would totally make my day: one GPU from the GTX, and another two GPUs from the K80, and then use SetDevice() in my CUDA programs.

You should be able to make the 4G change in the BIOS without the K80 plugged in.

yes, and I did, but that still doesn’t help if I cannot boot because of power connector problems. I need to solve that first. If the K80 is not connected with power from the 8pin connector, I can boot just fine, but then the card isn’t there. I just sits idle in the PCI slot. But the card isn’t really “on”. It doesn’t get warm, there’s some LED inside that is on, but that’s it. The power through the PCI connector doesn’t seem to be enough to make it work, it definitely needs the 8pin power to even be “on”. So without the power connector problem solved, I make no progress.

Per specification, the PCIe slot can supply at most 75W. In many common GPU designs, the PCIe slot only supplies about 40W. For additional power, there are auxiliary PCIe power connectors, with each 6-pin connector able to supply up to 75W, and each 8-pin connector able to supply 150W.

Each PCIe auxiliary power connector on a GPU must be hooked up to an appropriate cable from the PSU (power supply unit) for the GPU to work correctly. Avoid converters (e.g. 6-pin to 8-pin, molex to 6-pin) in the supply cables, as well as daisy chaining (multiple connectors on the same cable). Make sure that your PSU has sufficient wattage to drive GPUs, CPUs, and other system components.

OK, progress. Instead of all that cable fumble and buying cables on ebay and amazon that don’t work, I bought a 600W power supply for less than $20 and soldered the power cables on my own. I can now boot, and lspci shows the K80. In fact, it recognizes that there are two GPUs on it. With lspci I get:

06:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

but, it’s not in nvidia-smi nor in nvidia-settings.

Regarding the 4G BIOS setting, I can’t find it in my BIOS. Does that mean the change is unnecessary? If it’s necessary, would that even prevent the boot? Or does the fact that lspci has it proves that the BIOS change is not necessary?

Do I have to “register” it somehow? As lspci has it, what do I have to do next so that the K80 is available?

Remember, several people said that the 440.44 driver would be sufficient. And that’s the driver I use. But I still can’t see the K80 in nvidia-smi nor in nvidia-settings. Only in lspci.

On another page I saw something about adding something to the kernel line?

Do I have to rerun the driver installer?

Thanks!

So, rerunning the 440.44 installer didn’t help. Same problem: lspci shows the K80, but neither nvidia-smi nor nvidia-settings. It’s properly cooled, it’s properly powered, now asking for help to actually get it work. The card is clearly “running”, because lspci has it.

Are you guys sure that the 440.44 driver supports Tesla? Because, as I mentioned above, in the “supported products” list for the 440.44, it does NOT list Tesla, and when I search for the latest K80 driver, it says 440.33.01, not 440.44.

I never said anything about 440.44 driver.

I already indicated which driver I would start with for a mixed GTX/Tesla config.

What is the output of the following commands:

dmesg |grep NVRM

(and)

lspci -vvv |grep -A 20 -i nvidia

[ 12.580230] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:06:00.0)
[ 12.580939] NVRM: The system BIOS may have misconfigured your GPU.
[ 12.581814] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:07:00.0)
[ 12.582530] NVRM: The system BIOS may have misconfigured your GPU.
[ 12.800501] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 12.800790] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.44 Sun Dec 8 03:38:56 UTC 2019

and

06:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation Device 106c
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 39
NUMA node: 0
Region 0: Memory at d9000000 (32-bit, non-prefetchable)
Region 1: Memory at (64-bit, prefetchable)
Region 3: Memory at d2000000 (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported

    Kernel modules: nouveau, nvidia_drm, nvidia

07:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Subsystem: NVIDIA Corporation Device 106c
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 39
NUMA node: 0
Region 0: Memory at da000000 (32-bit, non-prefetchable)
Region 1: Memory at (64-bit, prefetchable)
Region 3: Memory at d4000000 (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #16, Speed 8GT/s, Width x16, ASPM not supported

    Kernel modules: nouveau, nvidia_drm, nvidia

0e:00.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6315 Series Firewire Controller (rev 01) (prog-if 10 [OHCI])
Subsystem: VIA Technologies, Inc. VT6315 Series Firewire Controller
Physical Slot: 4
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 122
NUMA node: 0
Region 0: Memory at dbff0000 (64-bit, non-prefetchable)
Region 2: I/O ports at 6000
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2+ AuxCurrent=0mA PME(D0-,D1-,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [80] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [98] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W

41:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 980] (rev a1) (prog-if 00 [VGA controller])
Subsystem: eVga.com. Corp. Device 2983
Physical Slot: 6
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 165
NUMA node: 1
Region 0: Memory at fb000000 (32-bit, non-prefetchable)
Region 1: Memory at e0000000 (64-bit, prefetchable)
Region 3: Memory at de000000 (64-bit, prefetchable)
Region 5: I/O ports at 8000
[virtual] Expansion ROM at 000c0000 [disabled]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00318 Data: 0000
Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-

    Kernel driver in use: nvidia
    Kernel modules: nouveau, nvidia_drm, nvidia

41:00.1 Audio device: NVIDIA Corporation GM204 High Definition Audio Controller (rev a1)
Subsystem: eVga.com. Corp. Device 2983
Physical Slot: 6
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 143
NUMA node: 1
Region 0: Memory at faff0000 (32-bit, non-prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-

I found in another thread, that you pointed at the NVRM messages and the “unassigned” problem in lspci -vvv. And the last post on that thread said that the 4G option in the BIOS corrupted the overall system, and that the kernel option pci=nocrs,noearly was enough to get the BARs registered correctly. I’ll try that next.

Yes, you have an incompatibility between your system BIOS and this K80 GPU. There is nothing NVIDIA can do about that.

This is why Tesla products are usually sold in certified systems, and only recommended for usage that way. There is no general statement that you can plug a Tesla card in any system you want and it should just work.

https://www.nvidia.com/en-us/data-center/tesla/tesla-qualified-servers-catalog/

If you want to try and make this work, I suggest making sure you are using an up-to-date BIOS for your motherboard, and play with BIOS settings if needed to see if any affect the problem or system behavior. It may be that your kernel config parameters may help, I don’t know. I can’t offer any suggestions about your specific BIOS or setup, and really can’t help any further. You’re welcome to ask additional questions if you wish, but I’m unlikely to be able to respond. Perhaps someone else will chime in. Good luck.

1 Like

I totally understand what you’re saying, I was aware from the beginning that this was going to be tricky, and I thank you for your past support, your guidance (also in other threads on this matter) have been extremely helpful.

Almost there. Just before switching to graphical login after all those boot messages, it gets into rescue mode and leaves the screens black. In rescue mode I can now look at nvidia-smi as root and get

±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:06:00.0 Off | 0 |
| N/A 58C P0 52W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 Off | 00000000:07:00.0 Off | 0 |
| N/A 42C P0 69W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 980 Off | 00000000:41:00.0 Off | N/A |
| 0% 42C P0 46W / 185W | 0MiB / 4043MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

dmesg | grep NVRM seems to indicate the kernel module was built successfully:

[ 14.958871] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.33.01 Wed Nov 13 00:00:22 UTC 2019

and has no more BIOS errors.

Also lspci -vvv | -A 20 -i nvidia no longer shows “unassigned” in the Regions.

Is it possible that the system gets the GPUs in “bad order”, i. e. the GPUs from the K80 first, and then the GTX becomes device id 2? Is there any way to make the GTX device id 0? I have the sneaking suspicion that the only problem is that it tries to use device 0, aka, a GPU from the K80, for the display, but they don’t have any, so now I get into rescue mode before the GTX can do anything? Just a guess. If that’s the reason, how can “rearrange” the order of the devices? I want the GTX to be device id 0.

Maybe some PCI settings in the BIOS?

Is that done by the PCI slot number? Like slot 1 first, then slot 2, etc.?

Thanks!

A quick 5-minute experiment (swap GTX and K80 between the PCIe slots) should tell you whether enumeration is by physical slot.

Thanks njuffa and Robert, it works now. In fact, I had two of them, and I bought two of those cooling fans, and soldered my own cabling from the PCB to the external power supply. nvidia-settings and nvidia-smi both have both cards now:

Sat Jan 18 11:11:50 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Off | 00000000:04:00.0 On | N/A |
| 0% 58C P0 47W / 185W | 163MiB / 4043MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 Off | 00000000:0C:00.0 Off | 0 |
| N/A 50C P8 27W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla K80 Off | 00000000:0D:00.0 Off | 0 |
| N/A 38C P8 31W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla K80 Off | 00000000:43:00.0 Off | 0 |
| N/A 43C P8 27W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 Tesla K80 Off | 00000000:44:00.0 Off | 0 |
| N/A 31C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1793 G /usr/libexec/Xorg 33MiB |
| 0 2123 G /usr/bin/gnome-shell 47MiB |
| 0 2473 G /usr/libexec/Xorg 77MiB |
±----------------------------------------------------------------------------+

at the end of the day though, I’m underimpressed by the performance. My GTX 980 is still faster – and not only in terms of compute, also in terms of memcopies and throughput. But also sucks more power – the two GPUs on the K80 together are about the same power of the GTX. So I think for a few quick computations I’ll pick the GTX, and the two K80s (four GPUs) only for long-running background tasks for some AI codes I’m writing in CUDA. And they also have much more memory, the 980 only has 4 GB, the GPUs on a K80 have 12 GB each.

And they have ECC, unlike on the GTX, but I don’t think that’s important (glad to hear disagreeing opinions).

One more question: what is execution timeout? That’s 1 for the GTX 980 and 0 for the four GPUs on the K80. THANKS!

Keep in mind that the K80 represents five year old technology. Other than a larger memory (which is important for some applications), what K80 offers in spades that the GTX 980 has very little of is double-precision performance.

Transient memory errors due to cosmic radiation happen, at about the rate of 1 bit error per GB per year (assuming around the clock operation). If you have a cluster with a hundred GPUs, you will encounter several of these per day. In some use cases, a flipped bit is of no consequence, in others it may lead to invalid results (as the error propagates through the entire computation). In the worst case, it leads to disaster.