We have recently upgraded from a Tesla S1070 to a Fermi S2050 system
while at the same time up grading the SLES11. The CPU nodes are unchanged.
We have installed CUDA 3.2 successfully and the driver offered by the download
site (260.24).
Under the previous regime (SLES10 and CUDA 2.3) everything worked fine.
Currently, however even though nvidia-smi recognizes the 2 Fermi devices on
each GPU node and the permissions on /dev/nvidia* are 666, neither the PGI
‘pgaccelinfo’ command or codes that simply request device information when
compiled from the CUDA 3.2 toolkit stack can find a device on the GPU nodes.
We have recently upgraded from a Tesla S1070 to a Fermi S2050 system
while at the same time up grading the SLES11. The CPU nodes are unchanged.
We have installed CUDA 3.2 successfully and the driver offered by the download
site (260.24).
Under the previous regime (SLES10 and CUDA 2.3) everything worked fine.
Currently, however even though nvidia-smi recognizes the 2 Fermi devices on
each GPU node and the permissions on /dev/nvidia* are 666, neither the PGI
‘pgaccelinfo’ command or codes that simply request device information when
compiled from the CUDA 3.2 toolkit stack can find a device on the GPU nodes.
We solved our own problem here … or at least found a go around.
We backed up and reinstalled only the kernel module outside the
provided installer and then we ran the installer with the
–no-kernel-module
Install everything but the kernel module, and do not remove
any existing, possibly conflicting kernel modules. This
can be useful in some DEBUG environments. If you use this
option, you must be careful to ensure that a NVIDIA kernel
module matching this driver version is installed
seperately.
which completed without error. Apparently, the installation of the
kernel module inside the installer within our SGI SLES11sp1 images
was corrupting the required supporting libraries. These go in without
problem when they are installed without trying to put in the kernel
module at the same time.
This two step process. Add the required 3.2 kernel module first
with modprobe. Then run the installer with the above no-kernel
module option. This would seem to be a problem peculiar to our
SGI SLES11sp1 installation, but other may find this useful.
We solved our own problem here … or at least found a go around.
We backed up and reinstalled only the kernel module outside the
provided installer and then we ran the installer with the
–no-kernel-module
Install everything but the kernel module, and do not remove
any existing, possibly conflicting kernel modules. This
can be useful in some DEBUG environments. If you use this
option, you must be careful to ensure that a NVIDIA kernel
module matching this driver version is installed
seperately.
which completed without error. Apparently, the installation of the
kernel module inside the installer within our SGI SLES11sp1 images
was corrupting the required supporting libraries. These go in without
problem when they are installed without trying to put in the kernel
module at the same time.
This two step process. Add the required 3.2 kernel module first
with modprobe. Then run the installer with the above no-kernel
module option. This would seem to be a problem peculiar to our
SGI SLES11sp1 installation, but other may find this useful.