CUDA 5 on CentOS 6.5: deviceQuery fails and cudaGetDeviceCount() returns wrong values

gongxufei · February 23, 2014, 8:52am

I have just reinstalled my GPU workstation (4*Tesla C1060) with CentOS 6.5 and then installed cuda 5 on it. But the CUDA does not work.

things I have done:
(1)install: in text mode,log in as root, run “sh cuda_5.0.35_linux_64_rhel6.x-1.run”, all three parts (driver, toolkit and sample) were installed correctly. The file is downloaded from https://developer.nvidia.com/cuda-toolkit-archive
(2)test deviceQuery: log in as root, run “make” under path “/usr/local/cuda-5.0/samples/1_Utilities/deviceQuery”, run “./deviceQuery”, then I got:
[b][u][i]./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 10
→ invalid device ordinal[/i][/u][/b]

(3)test a program: no problem with compiling, run, then I got:
device(s) detected on this machine: 0
1.0000001.000000=61082585891995785255441237761970929664.000000
2.0000002.000000=0.000000
3.0000003.000000=0.000000
4.0000004.000000=0.000000
5.000000*5.000000=288207915396127595249952358400.000000

I really have no idea what to do with this :(
Could anyone give me some advice?
Thanks~

PS: program

#include <cuda_runtime.h>
#include <stdio.h>
#define N 5
__global__ void kernel(float *a, float *b)
{
    int tid = blockIdx.x*blockDim.x + threadIdx.x;
    if (tid<N)
        b [tid] = a[tid]*a[tid];
}

int main ()
{
    int i;
    
    int ndevice=0;
    cudaGetDeviceCount(&ndevice);
    printf("device(s) detected on this machine: %d\n", ndevice);
     
    float A[N], B[N], *a, *b;
    cudaMalloc(&a, sizeof(float)*N);
    cudaMalloc(&b, sizeof(float)*N);
    for(i=0;i<N;i++) A[i]=i+1;
    cudaMemcpy(a, A, sizeof(float)*N, cudaMemcpyHostToDevice);
    kernel<<<1,N>>>(a,b);
    cudaMemcpy(B, b, sizeof(float)*N, cudaMemcpyDeviceToHost);
    for(i=0;i<N;i++) printf("%f*%f=%f\n", A[i], A[i], B[i]);    
    return 0;
}

gongxufei · February 24, 2014, 2:52am

still unsolved…

Robert_Crovella · February 24, 2014, 5:34am

It’s possible that you have not properly disabled the nouveau driver on your system. do a google search on how to blacklist nouveau. Do that. Then remove nouveau from the initrd image with the following:

dracut -f /boot/initramfs-rpm -qa kernel --queryformat "%{PROVIDEVERSION}.%{ARCH}\n" | tail -1.img rpm -qa kernel --queryformat "%{PROVIDEVERSION}.%{ARCH}\n" | tail -1

Then reboot your machine and try again.

gongxufei · February 24, 2014, 1:16pm

Thanks for your reply. I followed your advice but it’s still the same.
Firstly, I blacklisted nouveau by adding the line “blacklist nouveau” into /etc/modprobe.d/blacklist.conf
Then I removed nouveau with what you said (and it return nothing in terminal) and rebooted the machine.
But deviceQuery still returns the same thing as I posted in the begining.

Robert_Crovella · February 24, 2014, 3:24pm

what is the result when you run:

nvidia-smi -a

?

gongxufei · February 24, 2014, 6:19pm

[root@GPU-A ~]# nvidia-smi -a
NVIDIA: could not open the device file /dev/nvidia1 (Input/output error).
NVIDIA-SMI has failed because it couldn’t communicate with NVIDIA driver. Make sure that latest NVIDIA driver is installed and running.

Robert_Crovella · February 24, 2014, 7:13pm

there is a problem with the driver install.

What is the result of running:

dmesg |grep NVRM

(and)

dmesg |grep nouv

gongxufei · February 24, 2014, 10:10pm

[root@GPU-A ~]# dmesg |grep NVRM
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012
NVRM: GPU at 0000:08:00.0 has fallen off the bus.
NVRM: RmInitAdapter failed! (0x12:0x2b:1893)
NVRM: rm_init_adapter(1) failed
NVRM: GPU at 0000:08:00.0 has fallen off the bus.
NVRM: RmInitAdapter failed! (0x26:0xffffffff:1183)
NVRM: rm_init_adapter(1) failed

[root@GPU-A ~]# dmesg |grep nouv
nothing was returned~

Robert_Crovella · February 24, 2014, 11:07pm

The GPU fallen off the bus message very often indicates some kind of hardware instability. Perhaps it is a loose connection, a power delivery issue, overheating issue, or some other system level instability.

I would suggest to start by reducing the system to a single C1060. See if you can get that running stable. If so, proceed to add more cards, and see if you can discover any power related, thermal related, or electrical connection issues.

gongxufei · February 24, 2014, 11:29pm

Thanks！I’ll try that.
Could it be possible that there are hardware problems with the GPU devices?

Robert_Crovella · February 24, 2014, 11:47pm

Yes, certainly. If you add the GPUs one at a time, you may discover one that appears to be unstable.

gongxufei · February 25, 2014, 3:49am

Another problem, But maybe a clue~
The machine always automatically boots immediately after it is shut down, no matter it is shut down by “poweroff” or by longpress on the power button. The only way to completely shut it down is to cut off the power!
I have noticed the problem before I reinstalled the OS but I didn’t pay attention.
Could this be caused by the same issue that causes the problem of GPUs? What is the issue most likely to be?

Topic		Replies	Views
deviceQuery fails CUDA Setup and Installation	2	2827	April 4, 2018
deviceQuery and deviceQueryDrv pass other CUDA programs fail CUDA Setup and Installation	3	1924	November 13, 2013
Sample devieQuery cuda program error in Cuda 10.0 and Centos 7 CUDA Setup and Installation	2	1012	April 1, 2019
CUDA deviceQuery fails on Ubuntu 18.04: cudaGetDeviceCount returned 101 CUDA Setup and Installation	2	1546	February 10, 2025
deviceQuery reports: cudaGetDeviceCount returned 10 -> invalid device ordinal / test results... F CUDA Programming and Performance	1	3632	July 2, 2013
cudaGetDeviceCount returned 38 CUDA Setup and Installation	1	2023	August 10, 2016
'cudaGetDeviceCount FAILED' with newest everything.. CUDA Programming and Performance	2	1375	December 17, 2010
cudaGetDeviceCount returned 999 CUDA Setup and Installation	1	1923	December 5, 2021
Help for "cudaGetDeviceCount returned 38" after ./deviceQuery CUDA Setup and Installation	7	5252	November 14, 2017
Linux installation error: cudaGetDeviceCount returned 30 -> unknown error CUDA Setup and Installation	9	19737	November 4, 2021

CUDA 5 on CentOS 6.5: deviceQuery fails and cudaGetDeviceCount() returns wrong values

Related topics