newbie struggling to get cuda-gdb to run example is CUDA-GDB user manual Problem getting cuda-gdb to

I am trying to troubleshoot why I cannot get the CUDA-GDB 3.2 to perform properly. I’m getting bizarre errors whenever I try to run the cuda-gdb. I read in the user manual that X11 cannot be running on the GPU that is used for debugging because the debugger effectively make the GPU look hung to the X server, resulting in a deadlock or crash.

I also did read the note that "the CUDA driver automatically excludes the device used by X11 from being picked by the application being debugged. This can change the behavior of the application.

I’m wondering if its a problem with the fact that the CUDA Driver Version is 4.0 and the CUDA Runtime Version is 3.2.
I’m also wondering if its a problem possibly with the driver not correctly selecting the proper device while debugging. Maybe the debugger is choosing the device running the X Server?

I did find the Nvidia X Server Settings. I confirmed that the Quadro FX 380 is the GPU for the X Screen: Screen 0 and has the monitors listed in the Display Devices field.
The GeForce GTX 260 does not have any X Ccreens listed nor any display devices.

I’ve also noticed that when the debugger outputs “__cuda_0=0x0”. I was wondering if this has anything to do with the known issue in Appendix B that mentions that “debugging applications using textures is not supported on GPUs with sm_type less than sm_20”.

I’ve been able to compile and run the sample programs. I’ve ran the deviceQuery and bandwidthTest.

The output is below. I also will listed some of the output from trying to use the debugger as provided in the CUDA-GDB user manual.
I will continue to research this problem, but any advice or a pointer in the right direction of understanding the problem is much appreciated.
Thank you!

Device 0: “GeForce GTX 260”
CUDA Driver Version: 4.0
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 939327488 bytes
Multiprocessors x Cores/MP = Cores: 27 (MP) x 8 (Cores/MP) = 216 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No

Device 1: “Quadro FX 380”
CUDA Driver Version: 4.0
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 267714560 bytes
Multiprocessors x Cores/MP = Cores: 2 (MP) x 8 (Cores/MP) = 16 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.10 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No

I’ve ran the bandwidth test:
Running on…

Device 0: GeForce GTX 260
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4337.3

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3781.1

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 96744.1

[bandwidthTest] - Test results:
PASSED

[jjackson@l-lnx101 CUDA-GDB]$ nvcc -g -G bitreverse.cu -o bitreverse

[jjackson@l-lnx101 CUDA-GDB]$ cuda-gdb bitreverse

NVIDIA (R) CUDA Debugger

3.2 release

Portions Copyright (C) 2008-2010 NVIDIA Corporation

GNU gdb 6.6

Copyright (C) 2006 Free Software Foundation, Inc.

GDB is free software, covered by the GNU General Public License, and you are

welcome to change it and/or distribute copies of it under certain conditions.

Type “show copying” to see the conditions.

There is absolutely no warranty for GDB. Type “show warranty” for details.

This GDB was configured as “x86_64-unknown-linux-gnu”…

Using host libthread_db library “/lib64/libthread_db.so.1”.

(cuda-gdb)

(cuda-gdb) b main

Breakpoint 1 at 0x400c30: file bitreverse.cu, line 28.

(cuda-gdb) b bitreverse

Breakpoint 2 at 0x401ec7: file bitreverse.cu, line 28.

(cuda-gdb) b 21

Breakpoint 3 at 0x400c24: file bitreverse.cu, line 21.

(cuda-gdb) r

Starting program: /mnt/nfs/netapp2/grad/jjackson/cuda-examples-3.2/CUDA-GDB/bitreverse

BFD: /lib64/libc.so.6: invalid relocation type 37

BFD: BFD 2.17.50 assertion fail /home/buildmeister/build/rel/gpgpu/toolkit/r3.2/debugger/cuda-gdb/bfd/elf64-x86-64.c:259

BFD: /lib64/libc.so.6: invalid relocation type 37

BFD: BFD 2.17.50 assertion fail /home/buildmeister/build/rel/gpgpu/toolkit/r3.2/debugger/cuda-gdb/bfd/elf64-x86-64.c:259

[Thread debugging using libthread_db enabled]

[New process 4922]

[New Thread 139669044221760 (LWP 4922)]

[Switching to Thread 139669044221760 (LWP 4922)]

Breakpoint 3, main () at bitreverse.cu:24

24 int main(void) {

(cuda-gdb) c

Continuing.

Breakpoint 3, main () at bitreverse.cu:24

24 int main(void) {

(cuda-gdb) c

Continuing.

Breakpoint 3, main () at bitreverse.cu:24

24 int main(void) {

(cuda-gdb) c

Continuing.

Breakpoint 1, main () at bitreverse.cu:28

28 for (i = 0; i < N; i++)

(cuda-gdb) c

Continuing.

Breakpoint 1, main () at bitreverse.cu:28

28 for (i = 0; i < N; i++)

(cuda-gdb) c

Continuing.

Breakpoint 2, 0x0000000000401ec7 in bitreverse (__cuda_0=0x0)

at bitreverse.cu:28

28 for (i = 0; i < N; i++)

Breakpoint 2, 0x0000000000401ec7 in bitreverse (__cuda_0=0x0)

at bitreverse.cu:28

28 for (i = 0; i < N; i++)

(cuda-gdb) c

Continuing.

0 → 0

1 → 128

2 → 64

3 → 192

4 → 32

5 → 160

6 → 96

… (editted to shorten)
253 → 191

254 → 127

255 → 255

Program exited normally.

Since the assertion is in elfx86-64.c, relocation type 37 is most likely for: R_X86_64_IRELATIVE. It looks like that was introduced pretty recently in /usr/include/elf.h on most linux distros. So, this looks like a case of running cuda-gdb 3.2 on a newer (and unsupported at the time of 3.2) distro – so question 1: What distribution are you running on?

The second issue could be caused by the first issue, though.

Upgrading to the latest cuda-gdb will probably fix these items.