CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation

tmurray · January 25, 2010, 2:27am

couldn’t you use cygwin and the windows pthreads library?

Gregory_Diamos · January 25, 2010, 12:53pm

Yes, but I assume that the reason people want to use windows is so that they can integrate it in with existing visual studio projects.

Sarnath · January 25, 2010, 2:34pm

While compiling in Cygwin, there is a way to tell “dont” develop dependencies on “Cygwin” DLLs… I forgot that option name…

May be, you can give it a try…

Lev · April 9, 2010, 7:26pm

This will make cuda programming very exclusive and will disallow cuda in educational process etc. Ordinal people will just turn to opencl which has cpu implementation or forget about gpu. For debugging speed is not so important. Usually new code just has a lot of small algorithmic errors misprints etc. You just need to check algorithm consistency first. GPU programming is tricky and making it more tricky will make it more exclusive. Is it a target to limit cuda developer crowd?

E.D_Riedijk · April 10, 2010, 6:30am

It is maybe just me, but I have never ever used the emulation mode to debug anything. Not because it is slow, just because there are several things that do NOT show up in emulation mode. And the things that do not show up are just those tricky things you mention.

seibert · April 10, 2010, 3:52pm

Same here. I’ve never solved a bug in my code with emulation mode. I would have said the only benefit to emulation mode was being able to use valgrind, but if cuda-memcheck does the same thing with the device, then I’m out of uses.

SPWorley · April 10, 2010, 9:57pm

Wow, I must be the only emulator fan!

I love the emulator for four reasons.

It compiles fast. I can fix a syntax error and compile my kernel and start running it all in 5 seconds. The GPU compile takes about 3 minutes. (Not a typo.) This sounds like a small whine, but in practice it make a big workflow difference. I am overclocking my i7 CPU to 4GHz only to speed up compile times. (nvcc isn’t multithreaded so raw MHz counts.)
I can printf() log anything I want very easily. Yes, there’s cuPrintf() for real GPUs but that has its own issues (and in fact I have my own runtime printf() similar to cuPrintf and it’s even more limited.)
I can add more complex instrumentation like asserts, stats like printing averages of quantities deep within the algorithm. You can’t do this on the GPU without the code potentially changing register (and shared/local) memory counts, which can affect not just speed but even launch configuration.
It’s much faster than cuda-GDB, by a factor of 14 on my code. I can’t sit around waiting for cudaGDB.

Is the emulator perfect? Not at all. But I use it a lot and am sad to see it go.

The other tool I always use is Ocelot. It doesn’t replace the emulator, but it does do a terrific job especially with memory access error detection.

Maybe Fermi and toolkit 3.1 or 4.0 will satiate me, though… rumor is there’s a printf (like?) support, and with the caches and register spilling it should be OK to add more code like asserts and stats building without breaking a kernel by using limited register resources.

E.D_Riedijk · April 11, 2010, 9:14am

Some questions & observations ;)

finding syntax errors must be faster than that, the slow part for me is in ptxas, but then nvcc is already done.
you are developing your algorithms on gpu? for me cuPrintf is for finding why my outcome is deviating from my MATLAB version. (okay, except for 2 programs where I cannot build a non-gpu version)
did you try cuda-gdb recently? I have never been able to use it, because of compilation time around the 2.1 days (normal compilation took already 30 minutes…) and never had a need afterwards, but a colleague told me that compilation for cuda-gdb has sped up greatly around 2.3, and also runtime was much better.
to your point 3, you can now add your own profiler counters. So that may help a lot for stats.

gonnet · April 11, 2010, 5:18pm

Hi Gregory,

You may perhaps be interested in the files contained in https://gforge.inria.fr/scm/viewvc.php/trun…32/?root=starpu . This is some wrapper for mingw written by some colleague, basically, you can still use your normal pthread code on top of windows without pain. This integrates quite well with autotools too … There are perhaps some features of pthread that are not implemented yet, but it took like a couple hours to port StarPU on windows with this header.

CÃ©dric

SPWorley · April 11, 2010, 10:17pm

You’re right that pure syntax errors are found just as fast with regular compiles… it is indeed ptxas that’s slow. What really kills my workflow is really the feedback I get from errors quickly found by actually running the kernel and the runtime reports an issue and aborts.

cuPrintf does work except it’s limited and awkward. Especially if there’s a memory corruption or crash, your printf buffer is lost too, so you don’t have any log just when that log is most useful! It also has a limited length per string and limited number of strings. Those can be boosted but they still start hitting limits. This isn’t because cuprintf is bad, it’s just because of the batch buffer design.

I should give cudaGDB a try again, but I have tried it in 2.3 and gave up in one run. I should test it more before making hasty decisions.

You’re also right about profiler counters… in my Monte carlo code I drop in emulator-only fancier statistics gathering, for example finding the mean and std of the number of times certain voxels are loaded, the min and max number of datapoints which can operate in the voxel, etc. So those are more algorithmic stats, not programming level frofile stats (which are also useful of course.)

MisterAnderson42 · April 12, 2010, 11:34am

Has anyone actually used cuda-memcheck sucessfully? I cannot find any documentation on it anywhere except for the command line options help printed when cuda-memcheck is run without any options. A quick test on my application produces an empty output file and nothing else.

What does it do?
What output should it produce?
What kinds of errors can it find?
What kinds of errors can it not find?

It sounds like a great tool, but without documented answers to these simple questions, it is useless.

jam11 · April 12, 2010, 12:19pm

Actually yes. Ther is a great tutorial on Dr Dobbs site http://www.drdobbs.com/high-performance-co…?queryText=cuda

on the subject. It is section 16 on GPU tutorial.
The output is the same as valgrind for C programs.

Lev · April 13, 2010, 12:14pm

Los Alamos… I remember it is a place where extraterrestrials from crashed in Roswell spaceship develop nuclear weapons and many other alien technologies. Of course, bugs in thier code are very sophisticated.

Jeroen · April 13, 2010, 9:07pm

I’ve used it a few times. Mostly for trying to locate out of bound memory reads/accesses which turn up because of a crashing program or faulty results. It shows if the access is read or write and the size of the data type. Pretty useful to narrow it down to a specific code line relatively fast.

plegresley · April 15, 2010, 9:26pm

So I have a question about cuda-memcheck. I was playing around with a simple kernel and generating out of bounds memory accesses to make sure I understood how it works and what it reports.

[codebox]#include <stdio.h>

global void myKernel(float *a, float *b, int n)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

b[idx] = 2.f*a[idx];

}

int main(int argc, char **argv)

{

float *a_d, *b_d;

int n;

dim3 grid, block;

cudaError_t stat;

n = 16;

cudaMalloc((void **)&a_d, n*sizeof(float));

cudaMalloc((void **)&b_d, n*sizeof(float));

grid.x = 1;

block.x = 256;

myKernel<<<grid, block>>>(a_d, b_d, n);

cudaThreadSynchronize();

stat = cudaGetLastError();

printf(“stat = %s\n”, cudaGetErrorString(stat));

return 0;

}

[/codebox]

I was expecting to see 240 invalid reads and writes but I only ever see one invalid read by one thread, and the thread number can change each time you run it. Sometimes it is 32, 128, 160, etc.

[codebox][legrespa@gpgpu ~]$ cuda-memcheck ./a.out

========= CUDA-MEMCHECK

stat = no error

========= Invalid read of size 4

========= at 0x00000028 in myKernel

========= by thread 160 in block 0

========= Address 0x00101280 is out of bounds

=========

========= ERROR SUMMARY: 1 errors

[legrespa@gpgpu ~]$ cuda-memcheck ./a.out

========= CUDA-MEMCHECK

stat = no error

========= Invalid read of size 4

========= at 0x00000028 in myKernel

========= by thread 32 in block 0

========= Address 0x00101080 is out of bounds

=========

========= ERROR SUMMARY: 1 errors

[/codebox]

So it seems like what cuda-memcheck reports is the first, and only the first, invalid memory access of what may be many invalid memory accesses?

plegresley · April 15, 2010, 9:26pm

So I have a question about cuda-memcheck. I was playing around with a simple kernel and generating out of bounds memory accesses to make sure I understood how it works and what it reports.

[codebox]#include <stdio.h>

global void myKernel(float *a, float *b, int n)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

b[idx] = 2.f*a[idx];

}

int main(int argc, char **argv)

{

float *a_d, *b_d;

int n;

dim3 grid, block;

cudaError_t stat;

n = 16;

cudaMalloc((void **)&a_d, n*sizeof(float));

cudaMalloc((void **)&b_d, n*sizeof(float));

grid.x = 1;

block.x = 256;

myKernel<<<grid, block>>>(a_d, b_d, n);

cudaThreadSynchronize();

stat = cudaGetLastError();

printf(“stat = %s\n”, cudaGetErrorString(stat));

return 0;

}

[/codebox]

I was expecting to see 240 invalid reads and writes but I only ever see one invalid read by one thread, and the thread number can change each time you run it. Sometimes it is 32, 128, 160, etc.

[codebox][legrespa@gpgpu ~]$ cuda-memcheck ./a.out

========= CUDA-MEMCHECK

stat = no error

========= Invalid read of size 4

========= at 0x00000028 in myKernel

========= by thread 160 in block 0

========= Address 0x00101280 is out of bounds

=========

========= ERROR SUMMARY: 1 errors

[legrespa@gpgpu ~]$ cuda-memcheck ./a.out

========= CUDA-MEMCHECK

stat = no error

========= Invalid read of size 4

========= at 0x00000028 in myKernel

========= by thread 32 in block 0

========= Address 0x00101080 is out of bounds

=========

========= ERROR SUMMARY: 1 errors

[/codebox]

So it seems like what cuda-memcheck reports is the first, and only the first, invalid memory access of what may be many invalid memory accesses?

MisterAnderson42 · April 15, 2010, 10:40pm

Yeah, I tried out cuda-memcheck now too. Really easy to setup and even works in Release builds. Introduced a simple out of bounds read in the middle of a huge app and it showed up right away.

There is still the open question of documentation on what it reports and what it doesn’t report, and it would be useful if it could return a given exit code on error (like valgrind) for automated testing scenarios. But all around a very useful and quick tool that will become my first step when debugging code that behaves strangely.

MisterAnderson42 · April 15, 2010, 10:40pm

Yeah, I tried out cuda-memcheck now too. Really easy to setup and even works in Release builds. Introduced a simple out of bounds read in the middle of a huge app and it showed up right away.

There is still the open question of documentation on what it reports and what it doesn’t report, and it would be useful if it could return a given exit code on error (like valgrind) for automated testing scenarios. But all around a very useful and quick tool that will become my first step when debugging code that behaves strangely.

E.D_Riedijk · April 16, 2010, 12:21pm

Well, it is just one line that has an error ;)

E.D_Riedijk · April 16, 2010, 12:21pm

Well, it is just one line that has an error ;)

Topic		Replies	Views
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	430082	March 25, 2010
CUDA 2.1 beta CUDA Programming and Performance	49	67160	December 3, 2008
NVIDIA has hade a huge mistake with HW debugger Single-GPU debugging not supported and no emulation& CUDA Programming and Performance	34	6026	August 7, 2010
CUDA Toolkit 3.0 released CUDA Programming and Performance	62	26015	September 21, 2010
CUDA 2.1 discussion CUDA Programming and Performance	71	63937	February 17, 2009
CUDA Toolkit and SDK v2.2 released CUDA Programming and Performance	59	64618	January 25, 2011
CUDA Toolkit 3.2 release candidate available to registered developers CUDA Programming and Performance	68	63108	December 3, 2010
Is emulation mode removed from CUDA 3.0? CUDA Programming and Performance	23	22596	July 3, 2010
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	53597	August 1, 2011
Random execution times and freezes with concurent kernels - 2 CUDA Programming and Performance	5	2632	November 10, 2015

CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation

Related topics