Driver Installing Problem for NVIDIA Tesla K80 under Linux

Hi there,

I have tried to install the driver for K80 under Ubuntu for 1 week and I can’t install successfully even though I have tried every method from NVIDIA forum/Google.

When I use lspci, the following shows up:
04:00.0 3D controller: NVIDIA Corporation Device 102d (rev a1)
05:00.0 3D controller: NVIDIA Corporation Device 102d (rev a1)

1.The motherboard I use is ASRock Fatal1ty Z97X Killer, which supports PCI-e 3.0 x 16. Would that be a problem?
http://www.asrock.com/mb/Intel/Fatal1ty%20Z97X%20Killer/

2.Every time when I want to install the driver for Tesla K80, the following information shows up:

a)the distribution-provided pre-install script failed

b)ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

I have tried installed different version of Ubuntu(12.04/14.04), it still didn’t work.

Can someone please explain what the problem is and how can I fix it so I can finally install the drivers?

Thanks,
Jiawen

Did you follow the instructions in the linux getting started guide?

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#abstract

That guide outlines 2 different paths and has step-by-step instructions for each. Don’t skip anything.

If you used the getting started guide, please identify at which step, using the guide number, you observed a problem.

Txbob, thank you for reply so fast!

Actually, I have reinstalled more than 20 times(and reinstalled linux(Ubuntu) more than 10 times because I wanted to avoid the fault installation which might influences the next installation ) and followed every step from the instruction(tried run and deb method).

3: sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb or 4: sudo ./cuda__linux.run
or
Download the driver from http://www.nvidia.com/download/driverResults.aspx/80892/en-us and then
sudo ./NVIDIA-Linux-x86_64-340.65.run

Every time when I finished one of three above, the same fault information shows up:

-a)the distribution-provided pre-install script failed

-b)ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

So you mean it’s not the problem of motherboard?

I appreciate if you could give me any suggestion.

Thank you!

start with a clean OS install.

Follow the run file installation method. Be sure to perform step 4.3 If you have trouble, identify at which step you have trouble.

I appreciate you replied!

I finished the step 4.3 every time. I also tried other methods to disable other driver/profiling which might get conflicting. I reinstalled a clean OS more than 10 times to restart the steps but the problem is still the same.

Is there any other possible solution?

Thank you so much!

I guess you don’t want to follow my instructions. Perhaps someone else can help you. Installing a K80 on a Asrock motherboard is probably not a good idea anyway. The K80 has specific cooling and motherboard support requirements. It’s quite possible that the motherboard won’t support it and/or you haven’t provided adequate cooling for the K80.

The K80 is designed to be used in OEM servers that are certified for the K80.

Hi txbob,

Your info is really helpful! Thank you!

Regards,
Jiawen

txbob,

What motherboard requirements are you referring to? Got a link? Would like to see if my current MB is sufficient.

Thanks,

John

No, there aren’t links to motherboard requirements. The K80 is not designed to be plugged into an arbitrary motherboard, in an arbitrary system you build. It may work of course, if you know what you’re doing, but it may not.

The K80 is intended to be used in qualified OEM systems. The link for that is here:

http://www.nvidia.com/object/where-to-buy-tesla.html

You’re welcome to do whatever you want with your K80 of course. But your mileage may vary.

If you want a more flexible Tesla GPU that is designed to be workable in a wider range of systems, including workstations, try the Tesla K40c

Thanks txbob. I am currently using dual Titan X’s, but they suffer from lack of double precision. I saw a K80 listed on eBay for $2k and figured WTH, try it out. It shows up in my system using lscpi, and I upgraded to the latest CUDA libraries (7.5), but deviceQry still doesn’t see it. Shall I put it back on eBay, or is there something else I can try to get it working? I am very experienced with Linux, BIOS and other relevant areas, but I can’t seem to find a resource to tell me what specific things I should be checking to make the Tesla work.

The K80 depends on a closed-loop cooling feedback path that uses a sideband communication method on PCIE to interact with the BMC on the server motherboard which would then control system fans accordingly. This requires special firmware on the server motherboard (as well as a proper ducted system fan cooling arrangement). If you plug it into a “random” motherboard, then that won’t be present. Without it you are left to provide your own cooling solution. The only suggestion I can offer here is to blow as much air as possible across the K80 while it is powered on. I’m not going get into the nuances of fan choice, cooling LFM airflow, max inlet temperatures, ducting arrangements or anything like that. The careful attention to all that detail is what you get when you buy an OEM certified solution. I won’t be able to help with that.

To be clear: if the fan that you are using does not make it painful/annoying for you to be in the same room as the K80, you are probably not delivering enough airflow to support the necessary cooling under maximum load. The airflow should be directed such that it “exits” at the vented bracket. I don’t remember for sure whether the K80 reports its temperature properly when it is not in a certfied setting; I think it does. If it does (e.g. via nvidia-smi) then it would be wise to monitor that occasionally at least, to make sure temps are not out of control. (Especially monitor it under whatever is your definition of “heavy load”, or any time the card seems to be behaving “squirrely”.)

If the cooling is inadequate, the K80 may seem to power up OK and then give squirrely results at any of the following steps. If one of the following steps fails, and I can’t deduce what is going on, then that will be the extent of my “help”. Again, that is what you get in a certified solution. Something that is tested and guaranteed to work. This is not that.

The K80 also requires that the PCIE aux power connection be populated with a proper configuration that supplies 300W + Make sure your aux power is properly connected and capable of the load. I won’t be recommending specific power supplies, or any other specifics. I wouldn’t even bother trying this without a minimum 850W PSU, but you’re free to experiment. I make no specific recommendations except that the power supply must be able to support the load, such as it is defined in the board specification (below).

If in doubt about the location or configuration of the aux power conector, please refer to the published board specification:

http://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf

If you still want to continue, after taking care of the above items:

  1. The first step is to make sure the system bios has properly mapped the BAR regions on each device. (K80 consists of 2 GK210 GPU devices, both of which will show up separately in lspci). The reason this atypical step is needed is that each GPU reports a 16G BAR region to the system BIOS (amongst other resources) that need to be mapped during PCI PnP configuration. Most PCIE adapters of any type do not make such large resource requirements, and many system BIOSes will choke on this. Again, I make no specific recommendations about motherboard choice here. The only thing I will say is use the latest BIOS for your motherboard.

For this test, run the following command as root and report back(if you wish):

lspci -vvv |grep -i -A 20 nvidia

We’re looking for a complete set of properly mapped BARs for each of the 2 devices, which will look something like this (for each device):

Good:
Region 0: Memory at f8000000 (32-bit, non-prefetchable)
Region 1: Memory at d8000000 (64-bit, prefetchable)
Region 3: Memory at d4000000 (64-bit, prefetchable)

Bad (note “unassigned”):
Region 0: Memory at c1000000 (32-bit, non-prefetchable)
Region 1: Memory at (64-bit, prefetchable)
Region 3: Memory at (64-bit, prefetchable

Note this is not the exact output you will see, but we are specifically looking for 3 BAR regions and whether or not they were mapped (“assigned”) by the BIOS.

Also note that just doing

lspci

and observing that all GPUs appear to be “present” is not sufficient.

  1. The remaining steps, for a careful reader, are essentially covered in the linux getting started guide:

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#abstract

Possibly the most important step, sometimes overlooked, if using the runfile installer method, is to remove the nouveau driver. This is covered in step 4.3.

The runfile install sequence would be:

  • do a clean install of your Linux OS. Certainly you want to use a version that is supported by whatever CUDA version you are using. I spend 90% of my time on RHEL OS’s (RHEL, CentOS, Fedora) so I am more familiar with those than e.g. Ubuntu. If you feel it’s necessary to use e.g. Linux mint or some other OS that isn’t one of the officially listed supported OS’s that’s fine, but I won’t be able to help nor would I bother at that point. And if you want to skip this step (start with an OS clean install) I won’t be able to help you. The runfile and package manager install methods are mutually excusive. If you choose one, you can’t switch to the other later without doing some careful cleanup first. Therefore system history matters. A clean OS install gives us a known history. I am not intending to help you unravel your system history if you skip this step. There are any number of historical actions that your system may have undergone that will torpedo this process.

Also note that during the clean OS install, you’ll want to select “SW development workstation” packages/groups or whatever is appropriate for your OS to get you basic C compilers and development packages. The runfile installer method also requires kernel development sources (kernel headers). OS install may be a convenient time to select these, although they can be located and installed later. Do not modify the kernel after installing, in any way (e.g. don’t run system updates, or anything like that). You can certainly do these things, but they require special care and handling. My purpose is not to document every possibility here, but to map out what I consider a “best likelihood” sequence. After you prove that your setup is capable, then you are free to modify to your heart’s content.

  • (after doing a clean install, you can do the above described lspci test, if you wish)

  • remove the nouveau driver. Again, it’s covered in section 4.3. For redhat OS’s, I’m confident in the following as necessary and sufficient (as root):

echo -e “blacklist nouveau\noptions nouveau modeset=0” > /etc/modprobe.d/disable-nouveau.conf
dracut --force

for ubuntu, this should work (as root):

echo -e “blacklist nouveau\noptions nouveau modeset=0” > /etc/modprobe.d/disable-nouveau.conf
update-initramfs -u

  • reboot (don’t skip this!! it’s essential to get the new initrd image you just created to be in use.)
  • download the appropriate CUDA runfile installer (http://www.nvidia.com/getcuda)
  • as root, run the runfile installer

That should be sufficient to get the GPUs working. Yes, there are other matters. Did you want to be able to compile the OpenGL Interop samples (6.3.1)? I haven’t covered that. It’s been covered elsewhere. If you ignore or skip the steps (6.1.1) about updating your PATH and LD_LIBRARY_PATH variables, then of course things may not work. If you skip the step about device node setup (4.4) then you may find that the GPUs are only accessible if you run a CUDA task as root first. I’m not trying to duplicate the entire getting started guide here; I assume you will read it and follow it.

You’ll also note that I’m not covering package manager installation. My suggestion would be to get the above sequence running first. After that, if you want to do package manager installation, the deltas should be understood and not specific to K80. (you would want to start over with a clean OS install, probably, otherwise follow the steps in section 2.6 carefully) If you want to do package manager installation on ubuntu and get some messages about package dependencies or whatnot, that is not a K80 issue and I probably won’t be able to help you with that. It’s easily possible to screw up package manager installation (especially if you don’t follow the directions) and grab the wrong driver, or a cuda toolkit that is mismatched with the driver, or otherwise have missing or mismatched pieces. In that case, maybe someone else can help or maybe google is your friend. The clean OS/runfile install essentially fireproofs you from all that. I’m not saying it’s nirvana, but in that respect, I find it simpler to get people up and running with a known good config.

Troubleshooting the above sequence should start with these commands as root:

lspci -vvv |grep -i -A 20 nvidia

nvidia-smi

dmesg |grep NVRM

Good luck!