cuda-gdb won't print local variables in kernel function

I am trying to debug my code using cuda-gdb v3.0. When I try to print the value of a variable located within a kernel it says

No symbol “var_name” in current context.

All my code is located within one .cu file and I am compiling with the following line:

$ nvcc -G -g test.cu -o test

The program compiles and runs fine, producing the correct output. Also I am definitely stepping past the declaration and assignment of the variable (though not stepping out of the kernel function) before trying to print its value.

Your help is much appreciated,
David

Can you try using the latest (4.0) toolkit?

Just for the sake of completeness, I solved the original problem I posted about by including the flag -arch=sm_20 when compiling with nvcc, although I found no documentation indicating this was necessary for my graphics card, a GeForce GTX 480. I then encountered two new problems:

When asking the debugger to return system info it listed my card as a gf100. As stated above this is not the model of my card. Not sure if this was causing any problems but it definitely didn’t make me feel too secure.

More importantly, I couldn’t switch to all of the threads that I had launched. Weirdly I was limited to a small subset of the threads with the lowest indices, although the debugger correctly reported the block and grid dimensions (which were both smaller than the maximum allowed sizes).

Upgrading to the version 4.0 toolkit, hopefully these problems go away.

A GTX480 is a gf100, so why was that unexpected?

Maybe because these threads weren’t running on the GPU when you tried to switch to them? If you launch 50,000 threads, they won’t all be running on the GPU at the same time.

Hmm ok, I guess both of those things make sense.

I expected to see something more like gf480 when asking for the system information. I’m not really familiar at all with nvidia’s naming conventions, just kinda assumed the worst I guess, thanks for the clarification.

As far as the thread switching goes, how do I get to a thread that’s not currently running? Almost certainly that’s my problem.

It is a little weird to see NVIDIA’s semi-internal names (ok, all the hardware review sites use these names too) for their chip architectures being used. Much like Intel uses names like “Conroe” or “Nehalem” to refer to a particular family of CPUs, NVIDIA uses these sort of codes to refer to families of GPUs. Examples (a non-comprehensive list):

G80 = 8800 GTX, Telsa 870

G92= 9800 GTX, …

GT200 = GTX 280

GF100 = GTX 480, Tesla C2050

GF110 = GTX 580

You can actually find these names listed in my favorite Wikipedia article:

I assume the upcoming Kepler architecture has an internal name like GK100 or something…

example: cuda block (2,0) thread (5,0,0)

The cuda-gdb manual is your friend:

HTH

Yes I’ve certainly done that and read through the manual a couple of times. My problem is that when I try to shift focus to a block beyond some threshold number, the debugger just sends me back to that number. It seems to default back to very specific block indices.

For example when using 512 threads per block and some arbitrary number of blocks the debugger puts me at block 14 if I try to access block 15 or greater. With 256 threads per block it won’t let me past block 29. I thought this might have something to do with the fact that the card has 15 streaming multiprocessors, since I’m getting hung up on multiples of 15 (since the indices start at 0). That’s why I though you were saying that I was missing some command that could tell the multiprocessors to process a new set of blocks or something.

Seemed like this guy was having a similar problem with the same card and version of cuda-gdb

Oops, I was meaning to reply to this issue as well. CUDA does not start all blocks at the beginning of the kernel, due to the way block scheduling is handled. Because blocks cannot be suspended and cannot be migrated, once they start, they run to completion on the multiprocessor they were assigned to. The resource usage of your kernel (registers, shared memory, threads per block) determine how many blocks can run simultaneously on a multiprocessor, which multiplied by the number of multiprocessors on your GPU determines the maximum number of active blocks. The rest of the blocks in your kernel launch will be held back until slots start to free up on the multiprocessors, at which point they will be sent out for execution.

I have no experience with cuda-gdb, but it sounds like this might be related. The behavior you are seeing indicates that you are varying the number of total active blocks by changing the block size (and therefore the resource usage).

Ok that makes sense why I’m getting hung up on those particular numbers and I get that cuda can’t simultaneously launch all the blocks, but I should still be able to access an arbitrary block in cuda-gdb. If I request to switch focus to some block using the command ‘cuda block (i,j)’ there is nothing in the documentation to suggest that I am limited to only the first batch of blocks that are launched (which is the problem that I am encountering).

I figured that if I requested to follow a block that wasn’t currently being executed cuda-gdb would run the program until it got to that block and then give me back control. It would make sense that once I accessed a larger indexed block I wouldn’t be able to return to a smaller indexed block (since that block would have already executed), but not the other way around.

To access an arbitrary block, you can set a conditional breakpoint on the block of your choice and run to that breakpoint. It is otherwise not possible to schedule a block into the GPU on-demand.

Thanks for the reply. I don’t think cuda-gdb v3.0 supports conditional breakpoints. They are not in the documentation for v3.0, but I do see them in the documentation for 4.0. Also the last responder in this post states this The Official NVIDIA Forums | NVIDIA. If that is the only way to access an arbitrary block then I guess I am stuck until I upgrade to 4.0.

Edit: Wow, I feel dumb. I just needed to set a breakpoint at the kernel and then keep telling the debugger to continue until my desired block became active. Hopefully my problems are now solved, sorry for my slowness and I appreciate the help.

Yes, I was going to reply back with this as an alternative solution, but I see you’ve figured that out yourself External Image