Problem with cuda 7 toolkit on centos 6.6

We purchased a tesla K80 and installed in a server running centos 6.6.

lspci | grep -i nvidia
44:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K80] (rev a1)
45:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K80] (rev a1)

uname -m && cat /etc/*release
x86_64
CentOS release 6.6 (Final)

I installed the driver using the wizard from here:

http://www.nvidia.com/Download/Find.aspx?lang=en-us

This installed driver verson 346.59:

modinfo /lib/modules/2.6.32-504.16.2.el6.x86_64/kernel/drivers/video/nvidia.ko
filename: /lib/modules/2.6.32-504.16.2.el6.x86_64/kernel/drivers/video/nvidia.ko
alias: char-major-195-*
version: 346.59
supported: external
license: NVIDIA
alias: pci:v000010DEd00000E00svsdbc04sc80i00*
alias: pci:v000010DEd00000AA3svsdbc0Bsc40i00*
alias: pci:v000010DEdsvsdbc03sc02i00
alias: pci:v000010DEdsvsdbc03sc00i00
depends: i2c-core
vermagic: 2.6.32-504.16.2.el6.x86_64 SMP mod_unload modversions
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_RemapLimit:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_CheckPCIConfigSpace:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RmMsg:charp
parm: NVreg_AssignGpus:charp

I then installed the cuda toolkit using instructions from:

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#axzz3ca8v4mMB

I chose the centos package release as recommended. This installed:

[root@genuse32 yum.repos.d]# rpm -qa | grep cuda
cuda-cusparse-7-0-7.0-28.x86_64
cuda-samples-7-0-7.0-28.x86_64
cuda-driver-dev-7-0-7.0-28.x86_64
cuda-npp-dev-7-0-7.0-28.x86_64
cuda-cufft-dev-7-0-7.0-28.x86_64
cuda-documentation-7-0-7.0-28.x86_64
cuda-7.0-28.x86_64
cuda-misc-headers-7-0-7.0-28.x86_64
cuda-curand-7-0-7.0-28.x86_64
cuda-cudart-7-0-7.0-28.x86_64
cuda-toolkit-7-0-7.0-28.x86_64
cuda-repo-rhel6-7-0-local-7.0-28.x86_64
cuda-cusolver-dev-7-0-7.0-28.x86_64
cuda-cublas-dev-7-0-7.0-28.x86_64
cuda-runtime-7-0-7.0-28.x86_64
cuda-license-7-0-7.0-28.x86_64
cuda-npp-7-0-7.0-28.x86_64
cuda-cufft-7-0-7.0-28.x86_64
cuda-visual-tools-7-0-7.0-28.x86_64
cuda-7-0-7.0-28.x86_64
cuda-cusparse-dev-7-0-7.0-28.x86_64
cuda-nvrtc-dev-7-0-7.0-28.x86_64
cuda-command-line-tools-7-0-7.0-28.x86_64
cuda-cusolver-7-0-7.0-28.x86_64
cuda-cublas-7-0-7.0-28.x86_64
cuda-drivers-346.46-0.x86_64
cuda-core-7-0-7.0-28.x86_64
cuda-curand-dev-7-0-7.0-28.x86_64
cuda-cudart-dev-7-0-7.0-28.x86_64
cuda-nvrtc-7-0-7.0-28.x86_64

I compiled all the sample scripts successfully as well. However, trying to run deviceQuery resulted in:

/usr/local/cuda/samples/NVIDIA_CUDA-7.0_Samples/bin/x86_64/linux/release/deviceQuery
/usr/local/cuda/samples/NVIDIA_CUDA-7.0_Samples/bin/x86_64/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

Running nvidia-smi results in:

nvidia-smi
Failed to initialize NVML: GPU access blocked by the operating system

Searching google all I have been able to find with any real value was a ubuntu thread from 2013 that indicated the kernel module installed is too new for the toolkit installed. It did not have a solution. Any help would be greatly appreciated!

It was probably unwise to install the driver using a runfile installer and then switch to the package manager method for other components. It’s possible that as you pulled in those other components, they pulled in driver components:

cuda-drivers-346.46-0.x86_64

that are incompatible with the 346.59 driver you installed.

If you’re going to use a runfile installer, I’d suggest starting over, and just using the cuda toolkit runfile installer. It will install a suitable driver along with the cuda toolkit.

You may need to reload the OS first, or do a good job of purging old nvidia components.

recap: either use only package manager method, or use only runfile installer methods.

mixing the two can be troublesome. This is referenced in the doc you indicated:

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#handle-uninstallation

Thanks txbob. I didn’t do the runtime installer, I installed the gpu driver from the site. I thought I had to do this so it was addressable from the OS, I did not realize the cuda toolkit installed a driver of it’s own. This was my mistake. The bizarre thing as I worked through it yesterday though was deleting the nvidia.ko from the kernel /lib/* and rebooting didn’t do the trick. It was still unaddressable. I accidentally resolved it by running and upgrade on the machine and got lucky that there just happened to be a new kernel available. After rebooting into the new kernel everything just magically worked and it only had the .46 driver that I needed. Thanks!

The GPU driver also has a runfile installer. When you access the driver download site that you linked, the only thing available there are runfile installers (for linux). So if you installed the driver from that site, you used a runfile installer “method”

The package manager method is accomplished without using that driver download site, and instead uses package commands such as yum or apt-get appropriate for whatever linux distro you have.

And apart from all that, the CUDA toolkit comes in both runfile installer formats and package manager methods/formats.

If you use a CUDA toolkit runfile installer, buried inside that CUDA toolkit runfile installer is a runfile installer for the driver (that happens to be bundled with the CUDA toolkit).

So I believe a clash occurred between your runfile installation method of the GPU driver with the package manager method you used to install the toolkit (which also brought driver components with it.)