Cannot install driver for NVIDIA tesla k40 cards on Fedora 20

Recently I get a NVIDIA tesla k40 card and I want to install it in my PC. Yet, however, the driver installation fails all the time so I want to look for answers from you professionals. My system runs on a Fedora 20 OS, and the motherboard of my PC is ASUS Z87-PLUS, you may also need to know that this motherboard has integrated graphics card on it and since tesla K40 does not have video output, I use the integrated graphics video output. The NVIDIA driver version I use is NVIDIA-Linux-x86_64-340.58.run

Here is what I got from /var/log/nvidia-installer.log:
ERROR
: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries ‘Kernel module load error’ and ‘Kernel messages’ at the end of the file ‘/var/log/nvidia-installer.log’ for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[ 516.835201] [] warn_slowpath_fmt+0x5c/0x80
[ 516.835204] [] ? proc_alloc_inum+0x46/0xe0
[ 516.835205] [] proc_register+0xc0/0x140
[ 516.835206] [] proc_mkdir_data+0x52/0x80
[ 516.835208] [] proc_mkdir_mode+0x13/0x20
[ 516.835245] [] nv_register_procfs+0x4c/0x1d0 [nvidia]
[ 516.835275] [] nvidia_init_module+0x2a6/0x7d1 [nvidia]
[ 516.835297] [] ? nv_drm_init+0x15/0x15 [nvidia]
[ 516.835321] [] nvidia_frontend_init_module+0x86/0x81a [nvidia]
[ 516.835326] [] do_one_initcall+0xfa/0x1b0
[ 516.835328] [] ? set_memory_nx+0x43/0x50
[ 516.835331] [] load_module+0x1d92/0x25e0
[ 516.835333] [] ? store_uevent+0x70/0x70
[ 516.835336] [] ? kernel_read+0x50/0x80
[ 516.835338] [] SyS_finit_module+0xa6/0xd0
[ 516.835341] [] system_call_fastpath+0x16/0x1b
[ 516.835342] —[ end trace 52008cc294abb559 ]—
[ 516.835366] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 16384M @ 0x0 (PCI:0000:01:00.0)
[ 516.835367] NVRM: The system BIOS may have misconfigured your GPU.
[ 516.835370] nvidia: probe of 0000:01:00.0 failed with error -1
[ 516.835746] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 516.835747] NVRM: None of the NVIDIA graphics adapters were initialized!
[ 516.835747] [drm] Module unloaded
[ 516.835806] NVRM: NVIDIA init module failed!

The result from ‘lspci | grep NVIDIA’ give me this:
01:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)

So the card should be installed physically right, I googled this problem on Google for a while, I found several reasons that may be responsible for this problem.
First, someone says this may caused due to no blacklising drivers by other implementator, so I edited the file /etc/modprobe.d/blacklist.conf, added these lines in it:
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv

Second, one of the reasons may also be that gcc version is not the same for building linux kernel and NVIDIA driver module, so I checked it:
‘cat /proc/version’ gives me this:
Linux version 3.15.10-200.fc20.x86_64 (mockbuild@bkernel02.phx2.fedoraproject.org) (gcc version 4.8.3 20140624 (Red Hat 4.8.3-1) (GCC) ) #1 SMP Thu Aug 14 15:39:24 UTC 2014
‘gcc -v’ gives me this:
gcc version 4.8.3 20140624 (Red Hat 4.8.3-1) (GCC)
So I think the compiler version should not be a problem.

I tried to add ‘acpi=off’ to my grub line (though I don’t know what this command does), and I also enabled the ‘above 4G memory mapping’ in BIOS. However, this problem still exists after everything I’ve done.

So do you have any suggestions? I’m looking forward to your professional advices. Thank you in advance.

Do you have a K40c or K40m? Stated another way, does your K40 have a fan on it as part of the heatsink?

  1. Check for updates to the system BIOS for your motherboard. It may be that the system BIOS is not assigning resources correctly. K40 has a large (12GB) on-board memory that is mapped into PCI space via one of the BARs.

  2. blacklisting nouveau may not be enough. You may need to remove it from the initrd image as well. You can do this after blacklisting by rebooting and issuing:

dracut --force

Thanks txbob, from lspci info I think my gpu card is a k40m one, does it make a difference if I use k40m or k40c. Anyway, I’ll check if your approach works and give a response later.

K40m vs. K40c should not matter from a driver perspective. However, K40m cannot cool itself, and should only be installed in a server that is designed to provide appropriate cooling for the K40. If you try to use a K40m in an ordinary desktop system (appears to be the case based on your motherboard) you’ll have disappointing results, as the K40m will overheat quickly. In fact, it’s remotely possible that the behavior you’re describing now is due to K40m overheating.

K40c on the other hand, has it’s own fan and keeps itself cool.

Ah, I see, thank you for the paciente answer. ^_^

I am having a similar problem with the same K40m device on Fedora 20. We installed the drivers using the instructions from the Fedora 20 subsection on this site: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#fedora-installation.

We have another older GPU that we are using for graphics GF100, Quadro 4000 and the drivers successfully installed for that device, but the K40m is not detected by nvidia-smi. There is enough airflow in our server to prevent the K40m from overheating, but we are concerned if making changes to the initrd image or other lower level system modifications are necessary to resolve our issues.

-Thanks.

Which driver is listed by nvidia-smi?
Did you use the runfile installer method or the package manager installer method (repo method)? (It seems you used the repo method based on the link you indicated.)
Do you have the proper power connections on the K40?
What is the output of:

lspci |grep -i nvidia

Also having this issue, but on Kubuntu.

nvidia-smi
Version 340.65, but only showing the Tesla C2070 and Quadro K6000 installed (there are 2 K40m’s installed as well).

lspci |grep -i nvidia
02:00.0 VGA compatible controller: NVIDIA Corporation GF100GL [Tesla C2050 / C2070] (rev a3)
02:00.1 Audio device: NVIDIA Corporation GF100 High Definition Audio Controller (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GF100GL [Quadro 6000] (rev a3)
03:00.1 Audio device: NVIDIA Corporation GF100 High Definition Audio Controller (rev a1)
83:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
84:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)

This might be due to the system BIOS not assigning resources to the K40 devices correctly. K40, with 12G of memory, requires a 16G BAR to be assigned/allocated (amongst other resources…). With 4 GPUs in a single system, your system BIOS may be choking on plug-n-play setup. This can be deduced with careful analysis of lspci -vvv output, compared to a known good configuration (you will find resource allocation differences).

What sort of system do you have these 4 GPUs plugged into? Do you have the latest system BIOS version installed?

And as I’ve mentioned already, K40m is really only designed to be used in a system that has been properly qualified by an OEM to hold the K40.

Do you have the appropriate aux power connections made to the K40 devices?

I’ll get the output of lspci -vvv

They’re all sitting in a Nitro T5-CPU Supercomputer, has two 1400 Watt Power Supplies, plenty of cooling, this thing is designed for Quad GPU configurations.

I’ll also checkup the bios as well.

The Xenon Nitro T5 platform appears to be quite old - 5+ years old. I’m pretty sure it has not been qualified for K40m.

I suggest running K40m in a qualified platform.

Yep, just exploring every avenue before we go down that path as a new system would have to be purchased.

Here is the pastebin from the lspci -vvv I’m not really sure what I am looking for if there is a problem.
http://pastebin.com/jvEZncRw

Take a look at this section, starting at line 1545:

83:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
        Subsystem: NVIDIA Corporation Device 097e
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at c1000000 (32-bit, non-prefetchable) 
        Region 1: Memory at <unassigned> (64-bit, prefetchable) 
        Region 3: Memory at <unassigned> (64-bit, prefetchable

See the lines that say ? That is badness. It is indicative of a failure of the system bios to map those GPU resource regions into system-addressable space.

Compare it to the equivalent section for a GPU that is “working”:

(starting at line 902):

03:00.0 VGA compatible controller: NVIDIA Corporation GF100GL [Quadro 6000] (rev a3) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 076f
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 114
        Region 0: Memory at f8000000 (32-bit, non-prefetchable) 
        Region 1: Memory at d8000000 (64-bit, prefetchable) 
        Region 3: Memory at d4000000 (64-bit, prefetchable) 
        Region 5: I/O ports at cc00 
        [virtual] Expansion ROM at f7f80000 [disabled] 

All is mapped, device is happy. Note that the Q6000 is not requesting a 16G mapped BAR. That is a significant difference between Q6000 and K40m in terms of general system compatibility, and one reason why it is recommended to use a qualified system.

I think that Nitro T5 has a supermicro motherboard. A system bios update may resolve this issue, but I wouldn’t bet my life on it.

(Your message text mentions Quadro K6000 but that is actually a Quadro 6000 - one generation older than K6000)

Now that I think about it some more, this may well be indicated already in the system message log.

dmesg |grep NVRM

may be instructive.

And to continue my rant about a qualified system. Cooling a passive GPU in a qualified system usually involves a closed-loop control system involving the server BMC that is monitoring GPU temp and adjusting system fans accordingly. The mechanism for determining GPU temperatures has changed over the course of Tesla GPU history, so a system qualified to keep an early Tesla GPU cool may not even know how to read the temperature of a modern Tesla GPU. That means after you solve the system bios issue, you may still have to carefully consider cooling airflow, and perhaps use a “maximum fans” setting if it is offered by your system configuration, in order to have a chance to keep things cool under load. Even this provides no guarantees. If you do get it working, I would spend some time monitoring the K40 temps under heavy load, before you declare victory.

Hi txbob,

May I ask what command you used to get those two outputs?

I think I figured it out: lspci -vv

The output was actually provided by cptwin. It is from lspci -vvv. Refer to entries #10 and #12 in this thread. All I did was pick out specific sections of the output that was already posted by cptwin.

I thought i’d comment on this thread since I just managed to get my k40 working on a non-validated motherboard. As long as you can enable “above 4G memory mapping” in your bios I found that acpi=off did work, but crippled the rest of the system. After reading through kernel options documentation for a while I found that the kernel option pci=nocrs,noearly was enough to get our BARs registered correctly.