Hi I’m fairly new to CUDA and I’ve been working on some code recently and I’m having some serious problems with cuda-gdb and my program in general.
I’m running this on:
Device 0:
Name: Tesla C1060
Compute capability: 1.3
Linux 2.6.18-194.11.1.el5 #1 SMP Tue Aug 10 16:39:28 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
NVRM version: NVIDIA UNIX x86_64 Kernel Module 260.19.26 Mon Nov 29 00:53:44 PST 2010
GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Wed_Nov__3_16:16:57_PDT_2010
Cuda compilation tools, release 3.2, V0.2.1221
NVIDIA (R) CUDA Debugger
3.2 release
Portions Copyright (C) 2008-2010 NVIDIA Corporation
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
* The program will execute to completion by normal execution but it will hang in cuda-gdb in cudaMalloc() and it never finishes.
If I pause execution in cuda-gdb when stepping over cudaMalloc() I can run a backtrace and find
#0 0x0000003e366cd1c3 in __select_nocancel () from /lib64/libc.so.6
#1 0x00002b24d0f07fd8 in ?? () from /usr/lib64/libcuda.so
#2 0x00002b24d0f059fe in ?? () from /usr/lib64/libcuda.so
#3 0x00002b24d0f05cb9 in ?? () from /usr/lib64/libcuda.so
#4 0x00002b24d0ea3faf in ?? () from /usr/lib64/libcuda.so
#5 0x00002b24d0e9a3ed in ?? () from /usr/lib64/libcuda.so
#6 0x00002b24d0f6efa4 in ?? () from /usr/lib64/libcuda.so
#7 0x00002b24d0bc066d in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#8 0x00002b24d0bb7b1a in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#9 0x00002b24d0bb1379 in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#10 0x00002b24d0be683b in cudaMalloc () from /usr/local/cuda/lib64/libcudart.so.3
#11 0x0000000000405c89 in Lattice::initialiseCuda (this=0x7fff35849b10) at lattice.cu:97
Why is cudaMalloc() stuck in __select_nocancel () from /lib64/libc.so.6 ??
Because of this hang I can’t debug my kernel which I believe has problems too.
If I comment out the kernel completely then cudaMalloc() does not hang but then I can’t do any debugging in the kernel.
This is the first cuda runtime is called which I believe does initialisation so maybe something is going wrong there?
* The application sometimes dumps to standard error or standard output (I’m not sure which) the following…
#.word 18, 130, 0x82aa8
#.word 18, 132, 0x82b58
#.word 18, 134, 0x82ba8
#.word 18, 136, 0x82bd0
#.word 18, 138, 0x82c20
#.word 18, 143, 0x82c48
and it continues like that for quite a while… Does anyone know what that stuff means.
My code is on GitHub GitHub - delcypher/lc if people want to see what I’m working with.
Thanks.