I have previously been able to get CUDA SDK examples working, but I moved my hard into a new machine. I believe I have configured both correctly:
[list=1]
[*]Fedora 15
[*]Tesla C2070
[*]Tried both 4.2 and 5.0 drivers/toolkit
[*]Driver installs (I change to runlevel 3 and blacklist nouveau with
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)
[*]Running make works fine for the SDK
[*]When I run I get
$ C/bin/linux/release/vectorAdd
[vectorAdd] starting...
Vector Addition
vectorAdd.cu(127) : CUDA Runtime API error 4: unspecified launch failure.
[*]If I run it using cuda-gdb the debugger freezes up until I break out with C-c. The backtrace shows it gets stuck in Memcpy
$ cuda-gdb ../../bin/linux/debug/vectorAdd
NVIDIA (R) CUDA Debugger
5.0 release
Portions Copyright (C) 2007-2012 NVIDIA Corporation
GNU gdb (GDB) 7.2
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/moeng/NVIDIA_GPU_Computing_SDK/C/bin/linux/debug/vectorAdd...done.
(cuda-gdb) r
Starting program: /home/moeng/NVIDIA_GPU_Computing_SDK/C/bin/linux/debug/vectorAdd
[Thread debugging using libthread_db enabled]
[vectorAdd] starting...
Vector Addition
[New Thread 0x7ffff705c700 (LWP 409)]
[Context Create of context 0x622630 on Device 0]
[Launch of CUDA Kernel 0 (VecAdd<<<(196,1,1),(256,1,1)>>>) on Device 0]
^C
Program received signal SIGINT, Interrupt.
0x000000356a2099e1 in pthread_mutex_lock () from /lib64/libpthread.so.0
(cuda-gdb) bt
#0 0x000000356a2099e1 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1 0x00007ffff7252a17 in ?? () from /usr/lib64/libcuda.so
#2 0x00007ffff732b9c0 in ?? () from /usr/lib64/libcuda.so
#3 0x00007ffff732bf7a in ?? () from /usr/lib64/libcuda.so
#4 0x00007ffff732c1b2 in ?? () from /usr/lib64/libcuda.so
#5 0x00007ffff7318268 in ?? () from /usr/lib64/libcuda.so
#6 0x00007ffff7319169 in ?? () from /usr/lib64/libcuda.so
#7 0x00007ffff730f279 in ?? () from /usr/lib64/libcuda.so
#8 0x00007ffff7245787 in ?? () from /usr/lib64/libcuda.so
#9 0x00007ffff724977d in ?? () from /usr/lib64/libcuda.so
#10 0x00007ffff7231406 in ?? () from /usr/lib64/libcuda.so
#11 0x00007ffff7daaf5a in ?? () from /usr/local/cuda/lib64/libcudart.so.5.0
#12 0x00007ffff7dd798f in cudaMemcpy ()
from /usr/local/cuda/lib64/libcudart.so.5.0
#13 0x00000000004011dd in main (argc=1, argv=0x7fffffffe528)
at vectorAdd.cu:127
[*]uda-memcheck freezes if I try to run it
I get similar errors with any other SDK example that actually uses the GPU (deviceQuery works, bandwidthTest does not). I’m not sure whether I configured something wrong with my machine or whether my card is messed up. Below is the output from deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "Tesla C2070"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes (5636292608 bytes)
(14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1494 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device = Tesla C2070
I’m not sure what other things to check, any pointers would be much appreciated.