SDK examples stopped working

I have previously been able to get CUDA SDK examples working, but I moved my hard into a new machine. I believe I have configured both correctly:

[list=1]

Fedora 15

Tesla C2070

Tried both 4.2 and 5.0 drivers/toolkit

Driver installs (I change to runlevel 3 and blacklist nouveau with

mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img

dracut /boot/initramfs-$(uname -r).img $(uname -r)

Running make works fine for the SDK

When I run I get

$ C/bin/linux/release/vectorAdd

[vectorAdd] starting...

Vector Addition

vectorAdd.cu(127) : CUDA Runtime API error 4: unspecified launch failure.

If I run it using cuda-gdb the debugger freezes up until I break out with C-c. The backtrace shows it gets stuck in Memcpy

$ cuda-gdb ../../bin/linux/debug/vectorAdd

NVIDIA (R) CUDA Debugger

5.0 release

Portions Copyright (C) 2007-2012 NVIDIA Corporation

GNU gdb (GDB) 7.2

Copyright (C) 2010 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-unknown-linux-gnu".

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>...

Reading symbols from /home/moeng/NVIDIA_GPU_Computing_SDK/C/bin/linux/debug/vectorAdd...done.

(cuda-gdb) r

Starting program: /home/moeng/NVIDIA_GPU_Computing_SDK/C/bin/linux/debug/vectorAdd

[Thread debugging using libthread_db enabled]

[vectorAdd] starting...

Vector Addition

[New Thread 0x7ffff705c700 (LWP 409)]

[Context Create of context 0x622630 on Device 0]

[Launch of CUDA Kernel 0 (VecAdd<<<(196,1,1),(256,1,1)>>>) on Device 0]

^C

Program received signal SIGINT, Interrupt.

0x000000356a2099e1 in pthread_mutex_lock () from /lib64/libpthread.so.0

(cuda-gdb) bt

#0  0x000000356a2099e1 in pthread_mutex_lock () from /lib64/libpthread.so.0

#1  0x00007ffff7252a17 in ?? () from /usr/lib64/libcuda.so

#2  0x00007ffff732b9c0 in ?? () from /usr/lib64/libcuda.so

#3  0x00007ffff732bf7a in ?? () from /usr/lib64/libcuda.so

#4  0x00007ffff732c1b2 in ?? () from /usr/lib64/libcuda.so

#5  0x00007ffff7318268 in ?? () from /usr/lib64/libcuda.so

#6  0x00007ffff7319169 in ?? () from /usr/lib64/libcuda.so

#7  0x00007ffff730f279 in ?? () from /usr/lib64/libcuda.so

#8  0x00007ffff7245787 in ?? () from /usr/lib64/libcuda.so

#9  0x00007ffff724977d in ?? () from /usr/lib64/libcuda.so

#10 0x00007ffff7231406 in ?? () from /usr/lib64/libcuda.so

#11 0x00007ffff7daaf5a in ?? () from /usr/local/cuda/lib64/libcudart.so.5.0

#12 0x00007ffff7dd798f in cudaMemcpy ()

   from /usr/local/cuda/lib64/libcudart.so.5.0

#13 0x00000000004011dd in main (argc=1, argv=0x7fffffffe528)

    at vectorAdd.cu:127

uda-memcheck freezes if I try to run it

I get similar errors with any other SDK example that actually uses the GPU (deviceQuery works, bandwidthTest does not). I’m not sure whether I configured something wrong with my machine or whether my card is messed up. Below is the output from deviceQuery

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "Tesla C2070"

  CUDA Driver Version / Runtime Version          5.0 / 5.0

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:                 5375 MBytes (5636292608 bytes)

  (14) Multiprocessors x ( 32) CUDA Cores/MP:    448 CUDA Cores

  GPU Clock rate:                                1147 MHz (1.15 GHz)

  Memory Clock rate:                             1494 Mhz

  Memory Bus Width:                              384-bit

  L2 Cache Size:                                 786432 bytes

  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

  Total number of registers available per block: 32768

  Warp size:                                     32

  Maximum number of threads per multiprocessor:  1536

  Maximum number of threads per block:           1024

  Maximum sizes of each dimension of a block:    1024 x 1024 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             512 bytes

  Concurrent copy and execution:                 Yes with 2 copy engine(s)

  Run time limit on kernels:                     No

  Integrated GPU sharing Host Memory:            No

  Support host page-locked memory mapping:       Yes

  Concurrent kernel execution:                   Yes

  Alignment requirement for Surfaces:            Yes

  Device has ECC support enabled:                Yes

  Device is using TCC driver mode:               No

  Device supports Unified Addressing (UVA):      Yes

  Device PCI Bus ID / PCI location ID:           1 / 0

  Compute Mode:

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device = Tesla C2070

I’m not sure what other things to check, any pointers would be much appreciated.

Try to recompile the SDK examples.

To update, it looks like OpenCL SDK examples are working:

$ ./C/bin/linux/release/bandwidthTest

[bandwidthTest] starting...

./C/bin/linux/release/bandwidthTest Starting...

Running on...

Device 0: Tesla C2070

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     2206.1

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     2126.8

bandwidthTest.cu(914) : CUDA Runtime API error 4: unspecified launch failure.
$ ./OpenCL/bin/linux/release/oclBandwidthTest

[oclBandwidthTest] starting...

./OpenCL/bin/linux/release/oclBandwidthTest Starting...

Running on...

Tesla C2070

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     2030.0

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     2055.4

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     968102.5

[oclBandwidthTest] test results...

PASSED