The K80 depends on a closed-loop cooling feedback path that uses a sideband communication method on PCIE to interact with the BMC on the server motherboard which would then control system fans accordingly. This requires special firmware on the server motherboard (as well as a proper ducted system fan cooling arrangement). If you plug it into a “random” motherboard, then that won’t be present. Without it you are left to provide your own cooling solution. The only suggestion I can offer here is to blow as much air as possible across the K80 while it is powered on. I’m not going get into the nuances of fan choice, cooling LFM airflow, max inlet temperatures, ducting arrangements or anything like that. The careful attention to all that detail is what you get when you buy an OEM certified solution. I won’t be able to help with that.
To be clear: if the fan that you are using does not make it painful/annoying for you to be in the same room as the K80, you are probably not delivering enough airflow to support the necessary cooling under maximum load. The airflow should be directed such that it “exits” at the vented bracket. I don’t remember for sure whether the K80 reports its temperature properly when it is not in a certfied setting; I think it does. If it does (e.g. via nvidia-smi) then it would be wise to monitor that occasionally at least, to make sure temps are not out of control. (Especially monitor it under whatever is your definition of “heavy load”, or any time the card seems to be behaving “squirrely”.)
If the cooling is inadequate, the K80 may seem to power up OK and then give squirrely results at any of the following steps. If one of the following steps fails, and I can’t deduce what is going on, then that will be the extent of my “help”. Again, that is what you get in a certified solution. Something that is tested and guaranteed to work. This is not that.
The K80 also requires that the PCIE aux power connection be populated with a proper configuration that supplies 300W + Make sure your aux power is properly connected and capable of the load. I won’t be recommending specific power supplies, or any other specifics. I wouldn’t even bother trying this without a minimum 850W PSU, but you’re free to experiment. I make no specific recommendations except that the power supply must be able to support the load, such as it is defined in the board specification (below).
If in doubt about the location or configuration of the aux power conector, please refer to the published board specification:
http://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf
If you still want to continue, after taking care of the above items:
- The first step is to make sure the system bios has properly mapped the BAR regions on each device. (K80 consists of 2 GK210 GPU devices, both of which will show up separately in lspci). The reason this atypical step is needed is that each GPU reports a 16G BAR region to the system BIOS (amongst other resources) that need to be mapped during PCI PnP configuration. Most PCIE adapters of any type do not make such large resource requirements, and many system BIOSes will choke on this. Again, I make no specific recommendations about motherboard choice here. The only thing I will say is use the latest BIOS for your motherboard.
For this test, run the following command as root and report back(if you wish):
lspci -vvv |grep -i -A 20 nvidia
We’re looking for a complete set of properly mapped BARs for each of the 2 devices, which will look something like this (for each device):
Good:
Region 0: Memory at f8000000 (32-bit, non-prefetchable)
Region 1: Memory at d8000000 (64-bit, prefetchable)
Region 3: Memory at d4000000 (64-bit, prefetchable)
Bad (note “unassigned”):
Region 0: Memory at c1000000 (32-bit, non-prefetchable)
Region 1: Memory at (64-bit, prefetchable)
Region 3: Memory at (64-bit, prefetchable
Note this is not the exact output you will see, but we are specifically looking for 3 BAR regions and whether or not they were mapped (“assigned”) by the BIOS.
Also note that just doing
lspci
and observing that all GPUs appear to be “present” is not sufficient.
- The remaining steps, for a careful reader, are essentially covered in the linux getting started guide:
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#abstract
Possibly the most important step, sometimes overlooked, if using the runfile installer method, is to remove the nouveau driver. This is covered in step 4.3.
The runfile install sequence would be:
- do a clean install of your Linux OS. Certainly you want to use a version that is supported by whatever CUDA version you are using. I spend 90% of my time on RHEL OS’s (RHEL, CentOS, Fedora) so I am more familiar with those than e.g. Ubuntu. If you feel it’s necessary to use e.g. Linux mint or some other OS that isn’t one of the officially listed supported OS’s that’s fine, but I won’t be able to help nor would I bother at that point. And if you want to skip this step (start with an OS clean install) I won’t be able to help you. The runfile and package manager install methods are mutually excusive. If you choose one, you can’t switch to the other later without doing some careful cleanup first. Therefore system history matters. A clean OS install gives us a known history. I am not intending to help you unravel your system history if you skip this step. There are any number of historical actions that your system may have undergone that will torpedo this process.
Also note that during the clean OS install, you’ll want to select “SW development workstation” packages/groups or whatever is appropriate for your OS to get you basic C compilers and development packages. The runfile installer method also requires kernel development sources (kernel headers). OS install may be a convenient time to select these, although they can be located and installed later. Do not modify the kernel after installing, in any way (e.g. don’t run system updates, or anything like that). You can certainly do these things, but they require special care and handling. My purpose is not to document every possibility here, but to map out what I consider a “best likelihood” sequence. After you prove that your setup is capable, then you are free to modify to your heart’s content.
-
(after doing a clean install, you can do the above described lspci test, if you wish)
-
remove the nouveau driver. Again, it’s covered in section 4.3. For redhat OS’s, I’m confident in the following as necessary and sufficient (as root):
echo -e “blacklist nouveau\noptions nouveau modeset=0” > /etc/modprobe.d/disable-nouveau.conf
dracut --force
for ubuntu, this should work (as root):
echo -e “blacklist nouveau\noptions nouveau modeset=0” > /etc/modprobe.d/disable-nouveau.conf
update-initramfs -u
- reboot (don’t skip this!! it’s essential to get the new initrd image you just created to be in use.)
- download the appropriate CUDA runfile installer (http://www.nvidia.com/getcuda)
- as root, run the runfile installer
That should be sufficient to get the GPUs working. Yes, there are other matters. Did you want to be able to compile the OpenGL Interop samples (6.3.1)? I haven’t covered that. It’s been covered elsewhere. If you ignore or skip the steps (6.1.1) about updating your PATH and LD_LIBRARY_PATH variables, then of course things may not work. If you skip the step about device node setup (4.4) then you may find that the GPUs are only accessible if you run a CUDA task as root first. I’m not trying to duplicate the entire getting started guide here; I assume you will read it and follow it.
You’ll also note that I’m not covering package manager installation. My suggestion would be to get the above sequence running first. After that, if you want to do package manager installation, the deltas should be understood and not specific to K80. (you would want to start over with a clean OS install, probably, otherwise follow the steps in section 2.6 carefully) If you want to do package manager installation on ubuntu and get some messages about package dependencies or whatnot, that is not a K80 issue and I probably won’t be able to help you with that. It’s easily possible to screw up package manager installation (especially if you don’t follow the directions) and grab the wrong driver, or a cuda toolkit that is mismatched with the driver, or otherwise have missing or mismatched pieces. In that case, maybe someone else can help or maybe google is your friend. The clean OS/runfile install essentially fireproofs you from all that. I’m not saying it’s nirvana, but in that respect, I find it simpler to get people up and running with a known good config.
Troubleshooting the above sequence should start with these commands as root:
lspci -vvv |grep -i -A 20 nvidia
nvidia-smi
dmesg |grep NVRM
Good luck!