I wonder if some experts on CUDA installation (who know what’s happening during the process step by step) could help. I have to install CUDA for a cluster of GPU equipped nodes which boot not from local HD-s, but from a central server via network, from an image file prepared specially from a chrooted environment on the server. This network booting is done by Warewulf and works perfectly. Now I have to find the way the CUDA drivers and other CUDA software can be installed on this image in a way, that once booted by the nodes, everything works. Unfortunately, it is not as simple as running the downloaded nVidia installers…
I think, I have made significant progress on that, what’s more, I would expect it to work now, but unfortunately it does not. The present state is that if I run deviceQuery, I get the error message: “CUDA driver version is insufficient for CUDA runtime version”. I wonder, if you can help finding out, what I did wrong.
Let me sum up what I’ve done:
- Kernel driver install:
Since there is no CUDA capable GPU in the server (and I can’t insert one, since there’s no 16x PCIe slot), even to compile the kernel driver needed some tricks (the standard install mode fails, since it aborts when loading the driver is unsuccessful). I downloaded “devdriver_4.0_linux_64_270.41.19.run” from nVidia, extracted its content by the “-x” switch, went to the “kernel” directory in it, and issued the command “make module; make module-install”. This created and put the “nvidia.ko” kernel module in the appropriate place in the “/lib/modules/…/video/”, from where I copied it into the image file of the nodes. The kernel version on the nodes is exactly the same as the kernel version on the server.
I think, everything is fine up to this point, since when the nodes boot, the driver is loaded with no error seen in dmesg.
- CUDA libraries
The directory extracted from “devdriver_4.0_linux_64_270.41.19.run” in the previous step contained the already compiled *.so.270.41.19 files. These, studying the content of some nvidia lib rpm packages, prepackaged by volunteers, seems to be simply copied into /usr/lib64. This is what I did.
I’m not absolutely sure that I’m correct at this point. Can someone confirm it?
- CUDA toolkit install
I’ve downloaded and run the “cudatoolkit_4.0.17_linux_64_rhel6.0.run” package. As destination directory, I choose my home directory and not /usr/local, because the home directory is visible on (ifs mounted by) all nodes. Points 1. and 2. have done as root, but from this point I did all steps as regular user, but I don’t think it should matter, as long as I’m the only user.
I think this step is OK again, the installation went without error. After the installation I set my “LD_LIBRARY_PATH” to the respective “/home…/lib64” and “/home/…/lib” directories as asked by the installer.
- CUDA SDK install
I’ve downloaded and run the “gpucomputingsdk_4.0.17_linux.run” package, as normal user, and installed it into my home directory, as its default. After installing some dependencies which was turned out to be needed by the installer, the process was successful, all sample codes were compiled successfully. Again, since the home directory is shared, it’ll be visible by the nodes.
- Read. Try.
I guess, that’s all I had to do, I checked if it works. (Of course, it doesn’t :-( )
I booted the node. I checked the kernel module, it is loaded:
[pusztai@gpu01 ~]$ lsmod |grep nvidia nvidia 10713027 0 i2c_core 31274 1 nvidia
Since no X is running on the nodes, I used the short scrip from the nVidia PDF Guide to create the device drivers. I checked, they’re there, with the correct permission:
[pusztai@gpu01 ~]$ ls -al /dev/nvidia* crw-rw-rw- 1 root root 195, 0 Dec 16 00:05 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Dec 16 00:05 /dev/nvidia1 crw-rw-rw- 1 root root 195, 255 Dec 16 00:05 /dev/nvidiactl
I also checked the kernel version, etc.:
[pusztai@gpu01 ~]$ dmesg |tail -8 nvidia 0000:03:00.0: PCI INT A disabled nvidia 0000:04:00.0: PCI INT A disabled nvidia 0000:03:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 nvidia 0000:03:00.0: setting latency timer to 64 vgaarb: device changed decodes: PCI:0000:03:00.0,olddecodes=none,decodes=none:owns=io+mem nvidia 0000:04:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18 nvidia 0000:04:00.0: setting latency timer to 64 NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.41.19 Mon May 16 23:32:08 PDT 2011 [pusztai@gpu01 ~]$ cat /proc/version Linux version 2.6.32-131.21.1.el6.x86_64 (firstname.lastname@example.org) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Tue Nov 22 19:48:09 GMT 2011 [pusztai@gpu01 ~]$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 270.41.19 Mon May 16 23:32:08 PDT 2011 GCC version: gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC)
I checked, that the nvidia libraries are in /usr/lib64/
[pusztai@gpu01 ~]$ ls -al /usr/lib64/*.270.41.19 -rwxr-xr-x 1 root root 1008272 Dec 15 18:59 /usr/lib64/libGL.so.270.41.19 -rwxr-xr-x 1 root root 155544 Dec 15 18:59 /usr/lib64/libXvMCNVIDIA.so.270.41.19 -rwxr-xr-x 1 root root 9259326 Dec 15 18:59 /usr/lib64/libcuda.so.270.41.19 -rwxr-xr-x 1 root root 6327720 Dec 15 18:59 /usr/lib64/libglx.so.270.41.19 -rwxr-xr-x 1 root root 2042224 Dec 15 18:59 /usr/lib64/libnvcuvid.so.270.41.19 -rwxr-xr-x 1 root root 133064 Dec 15 18:59 /usr/lib64/libnvidia-cfg.so.270.41.19 -rwxr-xr-x 1 root root 20498976 Dec 15 18:59 /usr/lib64/libnvidia-compiler.so.270.41.19 -rwxr-xr-x 1 root root 27484752 Dec 15 18:59 /usr/lib64/libnvidia-glcore.so.270.41.19 -rwxr-xr-x 1 root root 85464 Dec 15 18:59 /usr/lib64/libnvidia-ml.so.270.41.19 -rwxr-xr-x 1 root root 6008 Dec 15 18:59 /usr/lib64/libnvidia-tls.so.270.41.19 -r-xr-xr-x 1 root root 295416 Dec 15 18:59 /usr/lib64/libnvidia-wfb.so.270.41.19 -rwxr-xr-x 1 root root 4064 Dec 15 18:59 /usr/lib64/libvdpau.so.270.41.19 -rw-r--r-- 1 root root 1656744 Dec 15 18:59 /usr/lib64/libvdpau_nvidia.so.270.41.19 -rwxr-xr-x 1 root root 46872 Dec 15 18:59 /usr/lib64/libvdpau_trace.so.270.41.19
I checked, if my LD_LIBRARY_PATH is correct:
[pusztai@gpu01 ~]$ echo $LD_LIBRARY_PATH /home/pusztai/cuda/lib64/:/home/pusztai/cuda/lib
So it seems, that everything is OK. But running deviceQuery fails:
[pusztai@gpu01 ~]$ /home/pusztai/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery [deviceQuery] starting... /home/pusztai/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version [deviceQuery] test results... FAILED Press ENTER to exit...
Thanks for your patience for reading this extra long post, I just wanted to provide all details about what I did.
I’m stuck here. I wonder if any of you would have some ideas, hints how to proceed.