deviceQuery and deviceQueryDrv pass other CUDA programs fail

Recently I installed the CUDA 5.5 package on our lab’s server with Tesla C2050 runnung on CentOS release 5.9. I can compile every samples and every simple CUDA programs I wrote on it but only the deviceQuery and deviceQueryDrv runs well. The gpu is stuck and gives no response when I run other programs such as BandwidthTest . After I run other programs, even the deviceQuery itself gives me nothing unless I restart the machine.

Here is the output of deviceQuery :

  1. Detected 1 CUDA Capable device(s)

    Device 0: “Tesla C2050”
    CUDA Driver Version / Runtime Version 5.5 / 5.5
    CUDA Capability Major/Minor version number: 2.0
    Total amount of global memory: 2687 MBytes (2817720320 bytes)
    (14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
    GPU Clock rate: 1147 MHz (1.15 GHz)
    Memory Clock rate: 1500 Mhz
    Memory Bus Width: 384-bit
    L2 Cache Size: 786432 bytes
    Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
    Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
    Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
    Total amount of constant memory: 65536 bytes
    Total amount of shared memory per block: 49152 bytes
    Total number of registers available per block: 32768
    Warp size: 32
    Maximum number of threads per multiprocessor: 1536
    Maximum number of threads per block: 1024
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
    Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
    Maximum memory pitch: 2147483647 bytes
    Texture alignment: 512 bytes
    Concurrent copy and kernel execution: Yes with 2 copy engine(s)
    Run time limit on kernels: No
    Integrated GPU sharing Host Memory: No
    Support host page-locked memory mapping: Yes
    Alignment requirement for Surfaces: Yes
    Device has ECC support: Enabled
    Device supports Unified Addressing (UVA): Yes
    Device PCI Bus ID / PCI location ID: 6 / 0
    Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = Tesla C2050
    Result = PASS

and the output of deviceQueryDrv

  1. ./deviceQueryDrv Starting...

    CUDA Device Query (Driver API) statically linked version
    Detected 1 CUDA Capable device(s)

    Device 0: “Tesla C2050”
    CUDA Driver Version: 5.5
    CUDA Capability Major/Minor version number: 2.0
    Total amount of global memory: 2687 MBytes (2817720320 bytes)
    (14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
    GPU Clock rate: 1147 MHz (1.15 GHz)
    Memory Clock rate: 1500 Mhz
    Memory Bus Width: 384-bit
    L2 Cache Size: 786432 bytes
    Max Texture Dimension Sizes 1D=(65536) 2D=(65536, 65535) 3D=(2048, 2048, 2048)
    Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
    Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
    Total amount of constant memory: 65536 bytes
    Total amount of shared memory per block: 49152 bytes
    Total number of registers available per block: 32768
    Warp size: 32
    Maximum number of threads per multiprocessor: 1536
    Maximum number of threads per block: 1024
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
    Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
    Texture alignment: 512 bytes
    Maximum memory pitch: 2147483647 bytes
    Concurrent copy and kernel execution: Yes with 2 copy engine(s)
    Run time limit on kernels: No
    Integrated GPU sharing Host Memory: No
    Support host page-locked memory mapping: Yes
    Concurrent kernel execution: Yes
    Alignment requirement for Surfaces: Yes
    Device has ECC support: Enabled
    Device supports Unified Addressing (UVA): Yes
    Device PCI Bus ID / PCI location ID: 6 / 0
    Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    Result = PASS

Could anyone shed some insight on this matter?

Hi, I got the same problem with you. Have you found the solution?

No, I didn’t. I installed CUDA 4.0 instead and it works fine for me

How about command “nvidia-smi”?