Plugging Tesla K80 results in PCI resource allocation error

Hi,

I bought a Tesla K80 card and tried to integrate it into a workstation PC (of course with sufficient ventilation). I already tried the card in combination with 3 different motherboards but no combination worked. The error I get from my Asus Rampage IV is:

D4: PCI resource allocation error

The system does not startup with the K80 plugged. After unplugging, everything runs fine. I already removed every other PCI card, but that did not make any difference.
On another motherboard (do not have the concrete type yet) the system started, but after installing the driver in windows the device manager states: Unsufficient resources available. Unplug another component to make it work. (free translation from German).
The third motherboard did not even start up - power good signal did not seem to come.

I tried to find some answers in these forums, unfortunately without success.

Any help appreciated!

This is to be expected on most workstation or consumer motherboards.

K80 (and K40) have large PCI BAR regions that need to be mapped into system PCI space. Many system BIOSes choke on this process, and it is one of the reasons that the normal way to deploy tesla products is in a OEM-qualified server.

You might have better luck with a specific ASUS workstation class motherboard or a specific Asrock workstation class motherboard, but I can’t really give you any guarantees or certifications.

Here’s an example of a thread on this board that has related information:

https://devtalk.nvidia.com/default/topic/790759/cannot-install-driver-for-nvidia-tesla-k40-cards-on-fedora-20/

I assume you are fully aware that what you are attempting to do is completely unsupported and you are proceding entirely at your own risk. My take on the situation is that you are risking permanent damage to a very expensive device. I would strongly advise looking for a ready-made server solution that comes with a properly integrated K80 GPU. Note that GPUs including the K80 plug into PCIe connectors, which are different from PCI connectors.

The following comments are provided for entertainment value only. I have not actually used a K80. Your description hints at two unrelated problems.

The fact that the system won’t start with the K80 plugged in would seem to indicate either that the GPU is not correctly supplied with power (are all relevant PCIe power connectors plugged in?), that you are overloading the system’s power supply (maybe just on a particular rail), or that the GPU is overheating. Other explanations are possible, such as creating an electrical short somewhere.

The second issue seems to be a lack of system resources, possibly insufficient size of a memory aperture needed by the system to communicate with the GPU, or conflict on an interrupt line, etc. This may be a question of your BIOS configuration, or your system infrastructure may be too old to provide the necessary resources. You would probably want to use a Haswell-based system or at least an IvyBridge-based system to try this with.

Damn, that is kind of interesting - and a littlebit scary, too. Unfortunately there is not much public information about the Teslas available, so I did not even know about this >4G problem. So I understand your advice would be to search (at least) for a certified motherboard, right?
njuffa, could you give me a hint what mean by “Note that GPUs including the K80 plug into PCIe connectors, which are different from PCI connectors.”? Do you refer to the power connectors? I noticed these were different from the old GPU connectors, but seem to be the same as modern CPU connectors.

I wasn’t referring to the PCIe power connectors (6-pin rated at 75W each, or 8-pin rated at 150W each). I was referring to the edge connectors carrying the signals between the system and the device. These are mechanically and electrically different between PCI and PCIe. Your post only referred to PCI, which seemed odd and, together with the other contents of your post, raised a red flag in my mind.

Ah, okay, sorry for being unprecise. Of course I meant PCIe. Thanks for your explanations.

Buyers of the Xeon Phi 31S1P (currently on sale by Intel) have the same kind of problems. Finding a matching motherboard is challenging , see for example Will your motherboard work with Intel Xeon Phi? and Xeon Phi 31s1p Cooling and Motherboards. One board known to work with the Xeon Phi is the Asus P9X79 WS. Also the Asus X99-A has the K40C listed in its Qualified Vendors List, so it might work with a K80 as well. As always YMMV, but hey, YOLO.

Most of modern motherboards have option “Above 4G decoding” in bios (or something similarly named), and it should be enabled for tesla. It’s not enabled by default. It could be difficult to find, it’s usually buried in the advanced PCI settings.
We have asus X99-S motherboard and managed to get tesla k80 running (was not easy, especially cooling)

If anyone else has information about affordable motherboards with “large BAR” or “Above 4G Decoding” support, I am all ears. I own one actively cooled Intel Xeon Phi board with 6GB RAM but haven’t had the opportunity to get it running yet.

Christian

Hey txbob, I’m benefiting from your comments on this topic. I had to look up BAR. In this context it probably means Base Address Register which makes sense. These GPU cards need a large contiguous block of memory pointed to by an address in a Base Address Register. My memory was jogged by looking up BAR here http://www.geocities.jp/technoart_jp/Abbreviation/abbrev_e_COM.html

Yes, BAR is Base Address Register

A google of “PCI BAR” will take you to the wikipedia page (first hit) which adequately defines PCI BAR. Each BAR corresponds to a Memory or I/O resource (i.e. range) that the device needs mapped into the system memory space. The size of the BAR (i.e. the size of the region required to be mapped) is easily discoverable by the PCI BIOS via a simple write/read sequence, and when properly mapped, the BAR should contain (i.e. the PCI BIOS should set) the base address of the mapped region.

SurgeCell, Could you give some indication about what you had to do to sufficiently cool your K80 and how you are monitoring the card temps? I installed one into a system with an ASUS P9X79-E-WS mobo and got it to solve an ANSYS test simulation but do not have a good way to monitor it yet. The first system that I tried is almost identical but has a P9X79-WS mobo and it failed to boot after I enabled the above 4G decoding. I still want to use this system and I am waiting to try a newer version of the BIOS after our supplier gives it to us.

Hello @DBarnett and @SergeCell cna you please share the procedure and configuration used to have a Tesla K80 working on a Workstation ? I am currently using workstation class MB AsRock Extreme 4 , X99 with Xeon processor and 128 GB DDR4 RAM. Thanks !