CUDA 5 on CentOS 6.5: deviceQuery fails and cudaGetDeviceCount() returns wrong values

I have just reinstalled my GPU workstation (4*Tesla C1060) with CentOS 6.5 and then installed cuda 5 on it. But the CUDA does not work.

things I have done:
(1)install: in text mode,log in as root, run “sh cuda_5.0.35_linux_64_rhel6.x-1.run”, all three parts (driver, toolkit and sample) were installed correctly. The file is downloaded from https://developer.nvidia.com/cuda-toolkit-archive
(2)test deviceQuery: log in as root, run “make” under path “/usr/local/cuda-5.0/samples/1_Utilities/deviceQuery”, run “./deviceQuery”, then I got:
[b][u][i]./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 10
→ invalid device ordinal[/i][/u][/b]

(3)test a program: no problem with compiling, run, then I got:
device(s) detected on this machine: 0
1.0000001.000000=61082585891995785255441237761970929664.000000
2.000000
2.000000=0.000000
3.0000003.000000=0.000000
4.000000
4.000000=0.000000
5.000000*5.000000=288207915396127595249952358400.000000

I really have no idea what to do with this :(
Could anyone give me some advice?
Thanks~

PS: program

#include <cuda_runtime.h>
#include <stdio.h>
#define N 5
__global__ void kernel(float *a, float *b)
{
    int tid = blockIdx.x*blockDim.x + threadIdx.x;
    if (tid<N)
        b [tid] = a[tid]*a[tid];
}

int main ()
{
    int i;
    
    int ndevice=0;
    cudaGetDeviceCount(&ndevice);
    printf("device(s) detected on this machine: %d\n", ndevice);
     
    float A[N], B[N], *a, *b;
    cudaMalloc(&a, sizeof(float)*N);
    cudaMalloc(&b, sizeof(float)*N);
    for(i=0;i<N;i++) A[i]=i+1;
    cudaMemcpy(a, A, sizeof(float)*N, cudaMemcpyHostToDevice);
    kernel<<<1,N>>>(a,b);
    cudaMemcpy(B, b, sizeof(float)*N, cudaMemcpyDeviceToHost);
    for(i=0;i<N;i++) printf("%f*%f=%f\n", A[i], A[i], B[i]);    
    return 0;
}

still unsolved…

It’s possible that you have not properly disabled the nouveau driver on your system. do a google search on how to blacklist nouveau. Do that. Then remove nouveau from the initrd image with the following:

dracut -f /boot/initramfs-rpm -qa kernel --queryformat "%{PROVIDEVERSION}.%{ARCH}\n" | tail -1.img rpm -qa kernel --queryformat "%{PROVIDEVERSION}.%{ARCH}\n" | tail -1

Then reboot your machine and try again.

Thanks for your reply. I followed your advice but it’s still the same.
Firstly, I blacklisted nouveau by adding the line “blacklist nouveau” into /etc/modprobe.d/blacklist.conf
Then I removed nouveau with what you said (and it return nothing in terminal) and rebooted the machine.
But deviceQuery still returns the same thing as I posted in the begining.

what is the result when you run:

nvidia-smi -a

?

[root@GPU-A ~]# nvidia-smi -a
NVIDIA: could not open the device file /dev/nvidia1 (Input/output error).
NVIDIA-SMI has failed because it couldn’t communicate with NVIDIA driver. Make sure that latest NVIDIA driver is installed and running.

there is a problem with the driver install.

What is the result of running:

dmesg |grep NVRM

(and)

dmesg |grep nouv

[root@GPU-A ~]# dmesg |grep NVRM
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012
NVRM: GPU at 0000:08:00.0 has fallen off the bus.
NVRM: RmInitAdapter failed! (0x12:0x2b:1893)
NVRM: rm_init_adapter(1) failed
NVRM: GPU at 0000:08:00.0 has fallen off the bus.
NVRM: RmInitAdapter failed! (0x26:0xffffffff:1183)
NVRM: rm_init_adapter(1) failed

[root@GPU-A ~]# dmesg |grep nouv
nothing was returned~

The GPU fallen off the bus message very often indicates some kind of hardware instability. Perhaps it is a loose connection, a power delivery issue, overheating issue, or some other system level instability.

I would suggest to start by reducing the system to a single C1060. See if you can get that running stable. If so, proceed to add more cards, and see if you can discover any power related, thermal related, or electrical connection issues.

Thanks!I’ll try that.
Could it be possible that there are hardware problems with the GPU devices?

Yes, certainly. If you add the GPUs one at a time, you may discover one that appears to be unstable.

Another problem, But maybe a clue~
The machine always automatically boots immediately after it is shut down, no matter it is shut down by “poweroff” or by longpress on the power button. The only way to completely shut it down is to cut off the power!
I have noticed the problem before I reinstalled the OS but I didn’t pay attention.
Could this be caused by the same issue that causes the problem of GPUs? What is the issue most likely to be?