No CUDA Device found ...

rbw · October 23, 2010, 10:10pm

All,

We have recently upgraded from a Tesla S1070 to a Fermi S2050 system
while at the same time up grading the SLES11. The CPU nodes are unchanged.
We have installed CUDA 3.2 successfully and the driver offered by the download
site (260.24).

Under the previous regime (SLES10 and CUDA 2.3) everything worked fine.

Currently, however even though nvidia-smi recognizes the 2 Fermi devices on
each GPU node and the permissions on /dev/nvidia* are 666, neither the PGI
‘pgaccelinfo’ command or codes that simply request device information when
compiled from the CUDA 3.2 toolkit stack can find a device on the GPU nodes.

What suggestions do you have?

Thanks in advance …

rbw

rbw · October 23, 2010, 10:10pm

All,

We have recently upgraded from a Tesla S1070 to a Fermi S2050 system
while at the same time up grading the SLES11. The CPU nodes are unchanged.
We have installed CUDA 3.2 successfully and the driver offered by the download
site (260.24).

Under the previous regime (SLES10 and CUDA 2.3) everything worked fine.

Currently, however even though nvidia-smi recognizes the 2 Fermi devices on
each GPU node and the permissions on /dev/nvidia* are 666, neither the PGI
‘pgaccelinfo’ command or codes that simply request device information when
compiled from the CUDA 3.2 toolkit stack can find a device on the GPU nodes.

What suggestions do you have?

Thanks in advance …

rbw

rbw · October 25, 2010, 5:08pm

All,

We solved our own problem here … or at least found a go around.

We backed up and reinstalled only the kernel module outside the

provided installer and then we ran the installer with the

–no-kernel-module

  Install everything but the kernel module, and do not remove

  any existing, possibly conflicting kernel modules.  This

  can be useful in some DEBUG environments.  If you use this

  option, you must be careful to ensure that a NVIDIA kernel

  module matching this driver version is installed

  seperately.

which completed without error. Apparently, the installation of the

kernel module inside the installer within our SGI SLES11sp1 images

was corrupting the required supporting libraries. These go in without

problem when they are installed without trying to put in the kernel

module at the same time.

This two step process. Add the required 3.2 kernel module first

with modprobe. Then run the installer with the above no-kernel

module option. This would seem to be a problem peculiar to our

SGI SLES11sp1 installation, but other may find this useful.

rbw

rbw · October 25, 2010, 5:08pm

All,

We solved our own problem here … or at least found a go around.

We backed up and reinstalled only the kernel module outside the

provided installer and then we ran the installer with the

–no-kernel-module

  Install everything but the kernel module, and do not remove

  any existing, possibly conflicting kernel modules.  This

  can be useful in some DEBUG environments.  If you use this

  option, you must be careful to ensure that a NVIDIA kernel

  module matching this driver version is installed

  seperately.

which completed without error. Apparently, the installation of the

kernel module inside the installer within our SGI SLES11sp1 images

was corrupting the required supporting libraries. These go in without

problem when they are installed without trying to put in the kernel

module at the same time.

This two step process. Add the required 3.2 kernel module first

with modprobe. Then run the installer with the above no-kernel

module option. This would seem to be a problem peculiar to our

SGI SLES11sp1 installation, but other may find this useful.

rbw