Install Problem

Thanks avidday. Here’s what happens:

$ nvidia-smi -lsa

==============NVSMI LOG==============

Timestamp : Wed Dec 9 09:09:08 2009

GPU 0:

    Product Name            : Tesla C1060

    PCI ID                  : 5e710de

    Temperature             : 34 C

GPU 1:

    Product Name            : Tesla C1060

    PCI ID                  : 5e710de

    Temperature             : 33 C

GPU 2:

    Product Name            : Tesla C1060

    PCI ID                  : 5e710de

    Temperature             : 34 C

GPU 3:

    Product Name            : Tesla C1060

    PCI ID                  : 5e710de

    Temperature             : 34 C

[mrosing@bouredhat release]$ ./deviceQuery

CUDA Device Query (Runtime API) version (CUDART static linking)

There is no device supporting CUDA

Test PASSED

Press ENTER to exit…

the cards are really there, so that’s a good start!

I have a bunch of stateless x86_64 Centos 5.1 cluster nodes with Teslas and the current 190 series release driver and CUDA 2.3 (but with the Fedora Core 10 kernel), and they look just like your system looks and it just works, so I don’t know what to suggest. I remember there were some problems mentioned in the release notes about lmem command line settings for Redhat 5 based systems, but I also seem to recall that was specific to IA-32 and not x86_64.

OK, thanks. I’ll see if I can contact someone directly at nVidia. This is a new setup and it may just be I need to include some other driver which is obvious but not automatic.

I put in a request for help to cuda@nvidia.com so we’ll see if I get a reply. In the mean time I ran ./deviceQueryDrv and got this:

CUDA Device Query (Driver API) statically linked version
There are 4 devices supporting CUDA

Device 0: “Tesla C1060”
CUDA Driver Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 1: “Tesla C1060”
CUDA Driver Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 2: “Tesla C1060”
CUDA Driver Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 3: “Tesla C1060”
CUDA Driver Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…

So the driver works. But deviceQuery does not work. The first line of code in deviceQueryDrv is

    CUresult err = cuInit(0);
CU_SAFE_CALL_NO_SYNC(cuDeviceGetCount(&deviceCount));

but in deviceQuery it is

cudaGetDeviceCount(&deviceCount);

Anybody know what cuInit does?

I’m definitely getting myself more confused. I copied the lines from deviceQueryDrv.cpp to deviceQuery.cpp just to see what would happen, and I get a compile error:

deviceQuery.cpp:35: warning: unused variable ‘err’
obj/release/deviceQuery.cpp.o: In function main': deviceQuery.cpp:(.text+0x38): undefined reference to cuInit’
deviceQuery.cpp:(.text+0x45): undefined reference to `cuDeviceGetCount’
collect2: ld returned 1 exit status

The exact same includes in deviceQueryDrv result in a successful compile. cuInit is included in cuda.h, and both deviceQueryDrv and deviceQuery #include cuda.h, so I don’t understand why it can’t link to it. Anybody have some ideas?

For the driver version (deviceQueryDrv) , you need to link against libcuda.so instead of the cuda runtime library libcudart.so (for deviceQuery).

You probably need to link with -L/usr/lib64 -lcuda

N.

I might as well just document what I’m doing, eventually somebody will point out what I’m doing wrong…

I figured out that the linker can choose either the driver version or the runtime version of the cuda libraries. By specifying

USEDRVAPI := 1

I got the cuInit(0) to compile - but it broke other things. Looking at the header files I note that some lines have CUDARTAPI and others have CUDAAPI to define what actually gets used. These are chosen by the lines

[codebox]# static linking, we will statically link against CUDA and CUDART

ifeq ($(USEDRVAPI),1)

 LIB += -lcuda   ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) $(CUDPPLIB) ${LIB} 

else

 LIB += -lcudart ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) $(CUDPPLIB) ${LIB}

endif

endif[/codebox]

in NVIDIA_GPU_Computing_SDK/C/common/common.mk.

Now, the cudartlib is in /usr/local/cuda/lib64 but I don’t see cudalib. Yet both lines obviously link because both files compile. Yet only the cuda driver library routines actually work. It may be I have no choice on how to build running applications - I’ll just use -lcuda and not use -lcudart so long as both have equivalent abilities. Anybody with any ideas on what is going on here?

edit: thanks for the pointer to -lcuda. Looks like that’s what I need to do for all programs.

Yeesh, I don’t know how much faster I can move backwards. I got a notice there were updates for RedHat, so I let the system install them and did a reboot. When I do

modprobe nvidia

I get

FATAL: Module nvidia not found.

I assume I have to re-run the nvidia driver installer and reboot again. But why would installing updates kill the original driver load?

For every kernel update, you need to reinstall the nvidia drivers after reboot so that they can be rebuild against the new kernel. AFAIK, other updates do not require this.

N.

The key word there is what I missed “rebuild against”. It’s not just loading the drivers, it’s building them into the kernel with the right addresses.

Thanks - I’ll be doing it a lot I suspect.

I tried rebuilding the driver to the new kernel. Here is some of the nvidia-installer.log:

-> Using the kernel source path ‘/usr/src/kernels/2.6.18-164.6.1.el5-x86_64’ as

specified by the ‘–kernel-source-path’ commandline option.

-> Kernel source path: ‘/usr/src/kernels/2.6.18-164.6.1.el5-x86_64’

-> Kernel output path: ‘/usr/src/kernels/2.6.18-164.6.1.el5-x86_64’

-> Performing rivafb check.

-> Performing nvidiafb check.

-> Performing Xen check.

-> Cleaning kernel module build directory.

executing: ‘cd ./usr/src/nv; make clean’…

-> Building kernel module:

[…]

ld -m elf_x86_64 -r -o /tmp/selfgz19235/NVIDIA-Linux-x86_64-190.18-pkg2/us

r/src/nv/nvidia.ko /tmp/selfgz19235/NVIDIA-Linux-x86_64-190.18-pkg2/usr/src/

nv/nvidia.o /tmp/selfgz19235/NVIDIA-Linux-x86_64-190.18-pkg2/usr/src/nv/nvid

ia.mod.o

NVIDIA: left KBUILD.

-> done.

-> Kernel module compilation complete.

ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most

   frequently when this kernel module was built against the wrong or

   improperly configured kernel sources, with a version of gcc that differs

   from the one used to build the target kernel, or if a driver such as

   rivafb/nvidiafb is present and prevents the NVIDIA kernel module from

   obtaining ownership of the NVIDIA graphics device(s), or NVIDIA GPU

   installed in this system is not supported by this NVIDIA Linux graphics

   driver release.

How do I go through all these possible problems and figure out which item has to be fixed? It looks like it passed the rivafb and nvidiafb checks, and I’d think gcc is ok, but I’m really not sure what needs to be configured in the kernel. I don’t need the GPU’s for graphics, they are all intended for computation. How do I tell the driver that?

I called RedHat and they told me to roll back to the previous kernel. So went to look at the boot stuff and grub.conf had been modified by the updater to boot the xen kernel by default! So the message was right - the driver was built against a different kernel than what was running. I changed grub.conf back to booting the standard kernel, reboot and now the nvidia driver loads and all the device files are created (I put the script in rc.local - seems to work fine).

So now I’m back to -lcuda works and -lcudart fails. Since cudalib is the lowest level and it runs, I can at least start building my own code. But it would be nice to know why the run time library fails.

To end this story - what I downloaded from the “get cuda” web page was inconsistent versions of driver, toolkit and sdk. Getting consistent versions of everything fixed the problem - everything now works. The 3.0 beta versions can be found here:

http://forums.nvidia.com/index.php?showtopic=149959