Hi huys,
I 've also a strange bug with cuda-gdb on my gpu cluster based on Tesla S1070 nodes.
We have cuda 3.2 installed (i think we have this problem since cuda 3.0) with Linux 64 bit driver 260.19.21
The OS is an "Red Hat Enterprise Linux Server release 5.3 " slightly modified by the cluster vendor.
We can’t break in cuda kernels.
After reading this post i tried a simple reproducer with all advices there were here but it still failed:
I compile my code in one step
nvcc -G -g matmul.cu -o matmul_debug
At execution, i try to set the focus without success and cuda-gdb finally crash and dump a core when trying to step in kernel
Program exited normally.
(cuda-gdb) run
Starting program: ./matmul_debug
[Thread debugging using libthread_db enabled]
[New process 23054]
Matrice réelle NxN: 1.05 Mo
[New Thread 47728650801024 (LWP 23054)]
[Switching to Thread 47728650801024 (LWP 23054)]
Breakpoint 1, kernel_mulmat (__cuda_0=0x100000, __cuda_1=0x200000, __cuda_2=0x300000, __cuda_3=512) at matmul.cu:6
6 __global__ void kernel_mulmat(real *A, real *B,real *C, int n){
(cuda-gdb) info cuda device
Focus not set on any running CUDA kernel.
(cuda-gdb) cuda device 0
No CUDA kernel is currently running.
(cuda-gdb) cuda device 1
No CUDA kernel is currently running.
(cuda-gdb) info cuda kernels
No active kernel on CUDA devices.
(cuda-gdb) step
Breakpoint 1, kernel_mulmat (__cuda_0=0x100000, __cuda_1=0x200000, __cuda_2=0x300000, __cuda_3=512) at matmul.cu:6
6 __global__ void kernel_mulmat(real *A, real *B,real *C, int n){
(cuda-gdb) info cuda threads
Focus not set on any running CUDA kernel.
(cuda-gdb) step
Breakpoint 1, kernel_mulmat (__cuda_0=0x100000, __cuda_1=0x200000, __cuda_2=0x300000, __cuda_3=512) at matmul.cu:6
6 __global__ void kernel_mulmat(real *A, real *B,real *C, int n){
(cuda-gdb) step
__device_stub__Z13kernel_mulmatPfS_S_i (__par0=0x100000, __par1=0x200000, __par2=0x300000, __par3=512) at /tmp/tmpxft_00005985_00000000-1_matmul.cudafe1.stub.c:6
6 /tmp/tmpxft_00005985_00000000-1_matmul.cudafe1.stub.c: No such file or directory.
in /tmp/tmpxft_00005985_00000000-1_matmul.cudafe1.stub.c
(cuda-gdb) step
7 in /tmp/tmpxft_00005985_00000000-1_matmul.cudafe1.stub.c
(cuda-gdb) step
cudaLaunch<char> (
entry=0x401062 "UH\211�\203�H\211}�\211u�\211U�211M�213M�\213U�\213u�\213}�\031����\220UH\211�\203�020�017\021E�\213E�\211E�\017\020E�\t��f\017(�\017\020\005�s") at /applications/cuda-3.2/bin/../include/cuda_runtime.h:935
935 return cudaLaunch((const char*)entry);
(cuda-gdb) step
BACKTRACE (9 frames):
cuda-gdb[0x459b2e]
/lib64/libc.so.6[0x3d36e30280]
/usr/lib64/libcuda.so[0x2b4092e3c06f]
/usr/lib64/libcuda.so[0x2b4092e379aa]
/usr/lib64/libcuda.so[0x2b4092e389dd]
/usr/lib64/libcuda.so[0x2b4092e39db5]
/usr/lib64/libcuda.so[0x2b4092fb90c9]
/lib64/libpthread.so.0[0x3d37a06367]
/lib64/libc.so.6(clone+0x6d)[0x3d36ed2f7d]
If i put the breakpoint in a line inside the kernel, it also crashes.
The strange thing is the same code, with the same cuda version, the same driver works on a machine with a CentOS 5.2 and a Quadro FX5800
We are waiting for Cuda 4 final release to see if it has changed…