cuda-gdb hang and compiled program spewing nonsense

Hi I’m fairly new to CUDA and I’ve been working on some code recently and I’m having some serious problems with cuda-gdb and my program in general.

I’m running this on:

Device 0:

Name: Tesla C1060

Compute capability: 1.3

Linux 2.6.18-194.11.1.el5 #1 SMP Tue Aug 10 16:39:28 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

NVRM version: NVIDIA UNIX x86_64 Kernel Module  260.19.26  Mon Nov 29 00:53:44 PST 2010

GCC version:  gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2010 NVIDIA Corporation

Built on Wed_Nov__3_16:16:57_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

NVIDIA (R) CUDA Debugger

3.2 release

Portions Copyright (C) 2008-2010 NVIDIA Corporation

GNU gdb 6.6

Copyright (C) 2006 Free Software Foundation, Inc.

GDB is free software, covered by the GNU General Public License, and you are

welcome to change it and/or distribute copies of it under certain conditions.

Type "show copying" to see the conditions.

There is absolutely no warranty for GDB.  Type "show warranty" for details.

This GDB was configured as "x86_64-unknown-linux-gnu".

* The program will execute to completion by normal execution but it will hang in cuda-gdb in cudaMalloc() and it never finishes.

If I pause execution in cuda-gdb when stepping over cudaMalloc() I can run a backtrace and find

#0  0x0000003e366cd1c3 in __select_nocancel () from /lib64/libc.so.6

#1  0x00002b24d0f07fd8 in ?? () from /usr/lib64/libcuda.so

#2  0x00002b24d0f059fe in ?? () from /usr/lib64/libcuda.so

#3  0x00002b24d0f05cb9 in ?? () from /usr/lib64/libcuda.so

#4  0x00002b24d0ea3faf in ?? () from /usr/lib64/libcuda.so

#5  0x00002b24d0e9a3ed in ?? () from /usr/lib64/libcuda.so

#6  0x00002b24d0f6efa4 in ?? () from /usr/lib64/libcuda.so

#7  0x00002b24d0bc066d in ?? () from /usr/local/cuda/lib64/libcudart.so.3

#8  0x00002b24d0bb7b1a in ?? () from /usr/local/cuda/lib64/libcudart.so.3

#9  0x00002b24d0bb1379 in ?? () from /usr/local/cuda/lib64/libcudart.so.3

#10 0x00002b24d0be683b in cudaMalloc () from /usr/local/cuda/lib64/libcudart.so.3

#11 0x0000000000405c89 in Lattice::initialiseCuda (this=0x7fff35849b10) at lattice.cu:97

Why is cudaMalloc() stuck in __select_nocancel () from /lib64/libc.so.6 ??

Because of this hang I can’t debug my kernel which I believe has problems too.

If I comment out the kernel completely then cudaMalloc() does not hang but then I can’t do any debugging in the kernel.

This is the first cuda runtime is called which I believe does initialisation so maybe something is going wrong there?

* The application sometimes dumps to standard error or standard output (I’m not sure which) the following…

#.word 18, 130, 0x82aa8

#.word 18, 132, 0x82b58

#.word 18, 134, 0x82ba8

#.word 18, 136, 0x82bd0

#.word 18, 138, 0x82c20

#.word 18, 143, 0x82c48

and it continues like that for quite a while… Does anyone know what that stuff means.

My code is on GitHub http://github.com/delcypher/lc if people want to see what I’m working with.

Thanks.

cuda-gdb hangs like that normally happen if you try debugging on a display GPU, which isn’t supported. If you have multiple devices you will need to make sure the process selects a non-display card to run on.

Yeah make sure X is not running if its a single GPU system i.e use console mode or use vnc or nxclient to remotely debug the target system.

Thanks for the replies. In this particular instance the server I was using was running X and I was using the first card which is probably what X is using. Is there a way to check? nvida-smi said it was plugged in but that doesn’t necessarily mean X is using it.

The machine has 4 cards in it. Presumably I’ll be able to use cuda-gdb with my program using one of those cards whilst X is running, right?

I think there is a bigger issue here than me picking the wrong card however as I have forced the program before to pick a different card before and it still hanged in cuda-gdb. I need to check where in the stack this is happening.

Unfortunately the machine has been shutdown over the weekend so I’ll report back in a few days.

The easiest thing to do on a machine running X11 is use nvidia-smi to put the display card(s) into compute prohibited mode. That will gaurantee that cuda-gdm can’t try establishing a context on a card which won’t work.

Couldn’t I just put all the other cards apart from the one X is using into COMPUTE exclusive mode?

Unfortunately I can’t use nvidia-smi to set the compute-mode as I do not have root access on the computer I’m running on. I have 4 cards… and nvidia-smi -q -a tells me… with the unnecessary stuff stripped out

Driver Version                  : 260.19.26

GPU 0:

        Product Name            : Quadro FX 3800

        PCI Device/Vendor ID    : 5ff10de

        PCI Location ID         : 0:1:0

        Display                 : Connected

        Utilization

            GPU                 : 0%

            Memory              : 3%

GPU 1:

        Product Name            : Tesla C1060

        PCI Device/Vendor ID    : 5e710de

        PCI Location ID         : 0:2:0

        Display                 : Not connected

        Utilization

            GPU                 : 0%

            Memory              : 0%

GPU 2:

        Product Name            : Tesla C1060

        PCI Device/Vendor ID    : 5e710de

        PCI Location ID         : 0:83:0

        Display                 : Not connected

        Utilization

            GPU                 : 0%

            Memory              : 0%

GPU 3:

        Product Name            : Tesla C1060

        PCI Device/Vendor ID    : 5e710de

        PCI Location ID         : 0:84:0

        Display                 : Not connected

        Utilization

            GPU                 : 0%

            Memory              : 0%

However I have tried calling each of these before cudaMalloc() was called and cudaMalloc() would still hang!

cudaSetDevice(1); //cudaMalloc() still hanged later!

cudaSetDevice(2); //cudaMalloc() still hanged later!

cudaSetDevice(3); //cudaMalloc() still hanged later!

I’ve produced self contain code which causes the problems I’ve been having with cuda-gdb. Would anyone be willing to try to reproduce this on their system? mod-test.cu is attached to this post.

To compile run…

nvcc -G -g mod-test.cu -o test

Now running cuda-gdb

cuda-gdb test

(gdb) break 68

(gdb) run 0 6 3

(gdb) n

###IT WILL HANG HERE! Press CTRL+C to pause execution

Program received signal SIGINT, Interrupt.

0x0000003e366cd1c3 in __select_nocancel () from /lib64/libc.so.6

Thanks in advance.
mod-test.cu (2.89 KB)

I also just noticed that in Appendix B known issues in the cuda-gdb manual. It says the following:

So it shouldn’t matter if one of the cards is being used by X. cuda-gdb should ignore it (even if I don’t call cudaSetDevice(); - cudaMalloc() still hangs later!).

I’ve been in contact with someone from Nvidia and we eventually tracked down the problem.

The system I use is a multi-user system and several of the users are working CUDA projects right now.

Although I am aware it isn’t possible to have multiple users using cuda-gdb at the same time this wasn’t the case as I checked that no one else was running cuda-gdb by running.

ps -AF | grep cuda

The hanging in cudaMalloc() whilst trying to use cuda-gdb was caused by old cuda-gdb session (ran by another user) that had crashed and had left a pipe in the /tmp/

Running the following command shows this:

ls -l /tmp/cuda*

prw-r----- 1 dl7749 stapdev    0 Feb 15 10:18 /tmp/cuda-gdb.pipe.2.3

-rw------- 1 aa7086 stapdev 2680 Feb 10 16:43 /tmp/cudagdb.rtobj.jrmNEx

-rw------- 1 dl7749 stapdev 2680 Feb 15 10:18 /tmp/cudagdb.rtobj.R0Nyw3

Deleting the old pipe fixed ( /tmp/cuda-gdb.pipe.2.3 ) (you might as well delete the temporary object files too) the problem.

The lesson here is… If cuda-gdb crashes clean up the contents /tmp/ so you can run cuda-gdb again.

It would be nice if cuda-gdb handled this situation more gracefully instead of giving the impression that the program being debugged is making cudaMalloc() to hang but at least there is a solution to the problem now.