I have previously been able to get CUDA SDK examples working, but I moved my hard into a new machine. I believe I have configured both correctly:
Tried both 4.2 and 5.0 drivers/toolkit
Driver installs (I change to runlevel 3 and blacklist nouveau with
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img dracut /boot/initramfs-$(uname -r).img $(uname -r)
Running make works fine for the SDK
When I run I get
$ C/bin/linux/release/vectorAdd [vectorAdd] starting... Vector Addition vectorAdd.cu(127) : CUDA Runtime API error 4: unspecified launch failure.
If I run it using cuda-gdb the debugger freezes up until I break out with C-c. The backtrace shows it gets stuck in Memcpy
$ cuda-gdb ../../bin/linux/debug/vectorAdd NVIDIA (R) CUDA Debugger 5.0 release Portions Copyright (C) 2007-2012 NVIDIA Corporation GNU gdb (GDB) 7.2 Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/moeng/NVIDIA_GPU_Computing_SDK/C/bin/linux/debug/vectorAdd...done. (cuda-gdb) r Starting program: /home/moeng/NVIDIA_GPU_Computing_SDK/C/bin/linux/debug/vectorAdd [Thread debugging using libthread_db enabled] [vectorAdd] starting... Vector Addition [New Thread 0x7ffff705c700 (LWP 409)] [Context Create of context 0x622630 on Device 0] [Launch of CUDA Kernel 0 (VecAdd<<<(196,1,1),(256,1,1)>>>) on Device 0] ^C Program received signal SIGINT, Interrupt. 0x000000356a2099e1 in pthread_mutex_lock () from /lib64/libpthread.so.0 (cuda-gdb) bt #0 0x000000356a2099e1 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007ffff7252a17 in ?? () from /usr/lib64/libcuda.so #2 0x00007ffff732b9c0 in ?? () from /usr/lib64/libcuda.so #3 0x00007ffff732bf7a in ?? () from /usr/lib64/libcuda.so #4 0x00007ffff732c1b2 in ?? () from /usr/lib64/libcuda.so #5 0x00007ffff7318268 in ?? () from /usr/lib64/libcuda.so #6 0x00007ffff7319169 in ?? () from /usr/lib64/libcuda.so #7 0x00007ffff730f279 in ?? () from /usr/lib64/libcuda.so #8 0x00007ffff7245787 in ?? () from /usr/lib64/libcuda.so #9 0x00007ffff724977d in ?? () from /usr/lib64/libcuda.so #10 0x00007ffff7231406 in ?? () from /usr/lib64/libcuda.so #11 0x00007ffff7daaf5a in ?? () from /usr/local/cuda/lib64/libcudart.so.5.0 #12 0x00007ffff7dd798f in cudaMemcpy () from /usr/local/cuda/lib64/libcudart.so.5.0 #13 0x00000000004011dd in main (argc=1, argv=0x7fffffffe528) at vectorAdd.cu:127
uda-memcheck freezes if I try to run it
I get similar errors with any other SDK example that actually uses the GPU (deviceQuery works, bandwidthTest does not). I’m not sure whether I configured something wrong with my machine or whether my card is messed up. Below is the output from deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking) Found 1 CUDA Capable device(s) Device 0: "Tesla C2070" CUDA Driver Version / Runtime Version 5.0 / 5.0 CUDA Capability Major/Minor version number: 2.0 Total amount of global memory: 5375 MBytes (5636292608 bytes) (14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores GPU Clock rate: 1147 MHz (1.15 GHz) Memory Clock rate: 1494 Mhz Memory Bus Width: 384-bit L2 Cache Size: 786432 bytes Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048) Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535 Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Concurrent kernel execution: Yes Alignment requirement for Surfaces: Yes Device has ECC support enabled: Yes Device is using TCC driver mode: No Device supports Unified Addressing (UVA): Yes Device PCI Bus ID / PCI location ID: 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device = Tesla C2070
I’m not sure what other things to check, any pointers would be much appreciated.