cuda-gdb hangs

Hello,

I am using Fedora 17 and Cuda 5.0 with a GeForce 690. Every time I run cuda-gdb it hangs for about a minute at the first Cuda runtime function (i.e., cudaMalloc(), cudaMemcpy, etc), For example:
[New Thread 0x7ffff6fa8700 (LWP 42361)]
[Context Create of context 0x9c5430 on Device 0]
[Launch of CUDA Kernel 0 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 3 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 4 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 5 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 6 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 7 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 8 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 9 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 10 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 11 (memset32_aligned1D<<<(1,1,1),(128,1,1)>>>) on Device 0]
(hangs here for 1 minute)

Has anyone seen this before? It is certainly annoying. I have noticed it has something to do with the size of the CUDA code, because if I commented a large portion of it out, the hang seems to disappear (Note: the hang happens before any of this code is hit). Also, the hang happens only once per GPU, so I’m guessing it has something to do with cuda-gdb initializing the GPU for debugginb.

UPDATE: If I run cuda-gdb in “non-stop” mode the hang disappears. I.e., if I do

# Enable the async interface.
     set target-async 1
     
     # If using the CLI, pagination breaks non-stop.
     set pagination off
     
     # Finally, turn it on!
     set non-stop on

When cuda-gdb starts up.
From http://www.sourceware.org/gdb/onlinedocs/gdb/Non_002dStop-Mode.html#Non_002dStop-Mode

I guess this allows the various Cuda runtime threads to do their thing in the background while I step thru code. Should I be doing this?

UPDATE #2:
The previous post does not allow me to debug CUDA kernels (it just skips right over them). So I guess I have to go back to the normal gdb mode and just deal with the delay.

Bump. Anyone else experience this hang?

Hi rmccabe3701, I am on the cuda-gdb team. Could I get some more information to look into this issue ?

  1. Does this happen for all applications or just a particular one ?
  2. If this only happens for a particular application, could you post its source and the options you used to build it ?
  3. Does this delay happen only on the first run of the application in a debug session ? Or does it persist for every run of the application within the same cuda-gdb session ? Or does this only happen the first time the application is run on a freshly booted system and subsequently disappear for fresh cuda-gdb invocations ?
  4. Could you include the output of cat /proc/driver/nvidia/version ?

I can tell you I’m getting the exact same problem. I’m working with a large application, and any simple call like a cudaMalloc() starts this up. Except mine goes for a bit longer than a minute. I’ll get to kernel 125, then it hangs, then it does a few more, hangs, gets to about 367, them moves on. Output looks like:

[Launch of CUDA Kernel 0 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]

[Launch of CUDA Kernel 367 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]

From here, debugging is slow. And it skips over kernels not really processing them.

For simple applications, this problem does not happen. Debugger works great. I’ve tried all sorts of permutations trying to get this error to pop up in simpler applications, and no luck.

Unfortunately, it’s going to be tough stripping down the complex application down into a simpler application to try and tell when and why this happens.

My cat is: cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012
GCC version: gcc version 4.6.3 (Debian 4.6.3-14)

Huh…looks like I need to update my kernel. Can’t believe I forgot to do this. :)

I’ll work on that and report on the result.

Edit:

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 310.40 Sun Mar 3 22:11:07 PST 2013
GCC version: gcc version 4.6.3 (Debian 4.6.3-14)

The same problem is still happening.

Edit #2:
I stripped down the program to just a plain main(). Left my g++ compile lines the same as before (lots of linking going on). Same bug.

After some careful narrowing down, I found out that one library in particular seems to be causing the mess. When I stopped linking against that one library, the problem went away.

.

.

Could you clarify which library this was ? If so, can you post the results of running ldd and readelf -d on the library ?

Hi again, I am still having the hang issue, and as Brad says, the hang seems to take much longer than 1 minute (>5 minutes).
vyas:

  1. Does this happen for all applications or just a particular one ?
    All applications that link in my libgpumac.a library (this has all the CUDA code in it)

  2. If this only happens for a particular application, could you post its source and the options you used to build it ?
    Can’t post the source (classified project). I can, however, post the build steps:

g++ -m64 -g -fno-inline-functions -O0 -DDEBUGGING -fPIC -DPIC -std=c++0x -D_GLIBCXX_USE_NANOSLEEP -I/usr/local/cuda/include -I./include -o src/gpuMgr.o -c src/gpuMgr.cpp

g++ -m64 -g -fno-inline-functions -O0 -DDEBUGGING -fPIC -DPIC -std=c++0x -D_GLIBCXX_USE_NANOSLEEP -I/usr/local/cuda/include -I./include -o src/subframe.o -c src/subframe.cpp

/usr/local/cuda/bin/nvcc -m64 -g -G -O0 -Xcompiler -gdwarf-2 -Xcompiler -g3 -Xcompiler -fno-inline-functions -Xcompiler -O0 -Xcompiler -DDEBUGGING -Xcompiler -fPIC -Xcompiler -DPIC -ccbin /usr/bin/gcc34 -arch=compute_30 -code=sm_30 -I/usr/local/cuda/include -I./include -o src/cudaWrappers.o -c src/cudaWrappers.cu

/usr/local/cuda/bin/nvcc -m64 -g -G -O0 -Xcompiler -gdwarf-2 -Xcompiler -g3 -Xcompiler -fno-inline-functions -Xcompiler -O0 -Xcompiler -DDEBUGGING -Xcompiler -fPIC -Xcompiler -DPIC -ccbin /usr/bin/gcc34 -arch=compute_30 -code=sm_30 -I/usr/local/cuda/include -I./include -o src/processSamples.o -c src/processSamples.cu

ar cru libgpumac.a src/gpuMgr.o src/subframe.o src/cudaWrappers.o src/processSamples.o

g++ -m64 -g -fno-inline-functions -O0 -DDEBUGGING -fPIC -DPIC -std=c++0x -D_GLIBCXX_USE_NANOSLEEP -I/usr/local/cuda/include -I./include -L/usr/local/cuda/lib64 -lcudart -lpthread -L/usr/local/cuda/lib64/ src/test/test_simConfig.cpp libgpumac.a -lACE -lpthread -o src/test/test_simConfig.bin

g++ -m64 -g -fno-inline-functions -O0 -DDEBUGGING -fPIC -DPIC -std=c++0x -D_GLIBCXX_USE_NANOSLEEP -I/usr/local/cuda/include -I./include -L/usr/local/cuda/lib64 -lcudart -lpthread -L/usr/local/cuda/lib64/ src/test/gpuMgrUnitTester.cpp libgpumac.a -lACE -lpthread -o src/test/gpuMgrUnitTester.bin

The last to commands build some of my unit test programs (both hang when running them thru cuda-gdb). You see that they link in the libgpumac.a library which is built above. NOTE: part of the problem may be nvcc is not compatible with gcc4.7 (which is the default compiler on Fedora 17). So to get it to work I needed to get gcc version 3.4 and instruct nvc to build with this compiler via “-ccbin /usr/bin/gcc34”. If i try to build everything with gcc4.7 i get the error:
/usr/local/cuda/include/host_config.h:82:2: error: #error – unsupported GNU version! gcc 4.7 and up are not supported!

  1. Does this delay happen only on the first run of the application in a debug session ? Or does it persist for every run of the application within the same cuda-gdb session ? Or does this only happen the first time the application is run on a freshly booted system and subsequently disappear for fresh cuda-gdb invocations ?
    The hang only happens once per cuda-gdb session – always when execution hits the first cuda-specific call (cudaMalloc for example). To be clear: the hang only happens when i run with cuda-gdb, the application itself is well-behaved.

  2. Could you include the output of cat /proc/driver/nvidia/version ?
    cat /proc/driver/nvidia/version
    NVRM version: NVIDIA UNIX x86_64 Kernel Module 310.32 Mon Jan 14 14:41:13 PST 2013
    GCC version: gcc version 4.7.2 20120921 (Red Hat 4.7.2-2) (GCC)

Here is one of my test program’s ldd and readelf output:

readelf -d src/test/gpuMgrUnitTester.bin

Dynamic section at offset 0x497990 contains 31 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libcudart.so.5.0]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libACE.so.6.1.7]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000c (INIT) 0x488ce0
0x000000000000000d (FINI) 0x4ca80c
0x0000000000000019 (INIT_ARRAY) 0xa97850
0x000000000000001b (INIT_ARRAYSZ) 48 (bytes)
0x000000000000001a (FINI_ARRAY) 0xa97880
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x000000006ffffef5 (GNU_HASH) 0x400260
0x0000000000000005 (STRTAB) 0x416070
0x0000000000000006 (SYMTAB) 0x405510
0x000000000000000a (STRSZ) 456398 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000015 (DEBUG) 0x0
0x0000000000000003 (PLTGOT) 0xa981f8
0x0000000000000002 (PLTRELSZ) 2736 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x488230
0x0000000000000007 (RELA) 0x486e68
0x0000000000000008 (RELASZ) 5064 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffffe (VERNEED) 0x486d88
0x000000006fffffff (VERNEEDNUM) 4
0x000000006ffffff0 (VERSYM) 0x48573e
0x0000000000000000 (NULL) 0x0

ldd src/test/gpuMgrUnitTester.bin
linux-vdso.so.1 => (0x00007fff88dff000)
libcudart.so.5.0 => /usr/local/cuda-5.0/lib64/libcudart.so.5.0 (0x00007fb65ba9a000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cfe600000)
libACE.so.6.1.7 => /lib64/libACE.so.6.1.7 (0x0000003c53000000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000003d08600000)
libm.so.6 => /lib64/libm.so.6 (0x0000003cff600000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003d02200000)
libc.so.6 => /lib64/libc.so.6 (0x0000003cfe200000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003cfea00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003cfee00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003cfde00000)

I would really appreciate any help in resolving this issue because waiting for cuda-gdb is a huge time-waster!

UPDATE: I randomly tried installing the newest NVIDIA driver for GeForce 690 (319.17) and the hang seems to be gone :) I’ll keep you posted if I run into more issues.

There was an issue in cuda-gdb where some code did not scale correctly. As a result, there was a slowdown proportional to the number of kernels and device functions when loading modules in an application. As you have discovered, this has been fixed :). The fix will also be present in the official driver accompanying the CUDA 5.5 toolkit.

Bump.

I’ve just come across a similar issue. If I set a pending breakpoint then the debugger crashes on the first CUDA call:

[Switching to Thread 0x7fffa3e7f700 (LWP 3515)]
The CUDA driver has hit an internal error.
Error code: 0x40ec800000026
Further execution or debugging is unreliable.
Please ensure that your temporary directory is mounted with write and exec permissions.

(cuda-gdb) bt
#0  0x00007fffa29d3d90 in cudbgReportDriverInternalError () from /usr/lib/libcuda.so
#1  0x00007fffa29d77fe in ?? () from /usr/lib/libcuda.so
#2  0x00007fffa2943625 in ?? () from /usr/lib/libcuda.so
#3  0x00007fffa28752da in ?? () from /usr/lib/libcuda.so
#4  0x00007fffa2861933 in cuInit () from /usr/lib/libcuda.so
#5  0x00007fffe7ddc965 in ?? () from /usr/local/cuda/lib64/libcudart.so.6.0
#6  0x00007fffe7ddca0a in ?? () from /usr/local/cuda/lib64/libcudart.so.6.0
#7  0x00007fffe7ddca3b in ?? () from /usr/local/cuda/lib64/libcudart.so.6.0
#8  0x00007fffe7df7647 in cudaSetDevice () from /usr/local/cuda/lib64/libcudart.so.6.0
...

If no breakpoints are set then it takes several minutes to pass through this call. I’m not experiencing any problems when not in cuda-gdb.

I’m using Ubuntu 12.04 + Tesla C2075.

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  331.75  Wed Apr 30 11:25:31 PDT 2014
GCC version:  gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

This is kind of strange as I believe I had none of these issues yesterday. Reverting codebase to the yesterday’s state didn’t help so I can’t say what precisely has triggered this cuda-gdb’s behaviour.

I’ve also reinstalled both the driver and the toolkit but this doesn’t help either.