CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation

couldn’t you use cygwin and the windows pthreads library?

Yes, but I assume that the reason people want to use windows is so that they can integrate it in with existing visual studio projects.

While compiling in Cygwin, there is a way to tell “dont” develop dependencies on “Cygwin” DLLs… I forgot that option name…

May be, you can give it a try…

This will make cuda programming very exclusive and will disallow cuda in educational process etc. Ordinal people will just turn to opencl which has cpu implementation or forget about gpu. For debugging speed is not so important. Usually new code just has a lot of small algorithmic errors misprints etc. You just need to check algorithm consistency first. GPU programming is tricky and making it more tricky will make it more exclusive. Is it a target to limit cuda developer crowd?

It is maybe just me, but I have never ever used the emulation mode to debug anything. Not because it is slow, just because there are several things that do NOT show up in emulation mode. And the things that do not show up are just those tricky things you mention.

Same here. I’ve never solved a bug in my code with emulation mode. I would have said the only benefit to emulation mode was being able to use valgrind, but if cuda-memcheck does the same thing with the device, then I’m out of uses.

Wow, I must be the only emulator fan!

I love the emulator for four reasons.

  1. It compiles fast. I can fix a syntax error and compile my kernel and start running it all in 5 seconds. The GPU compile takes about 3 minutes. (Not a typo.) This sounds like a small whine, but in practice it make a big workflow difference. I am overclocking my i7 CPU to 4GHz only to speed up compile times. (nvcc isn’t multithreaded so raw MHz counts.)

  2. I can printf() log anything I want very easily. Yes, there’s cuPrintf() for real GPUs but that has its own issues (and in fact I have my own runtime printf() similar to cuPrintf and it’s even more limited.)

  3. I can add more complex instrumentation like asserts, stats like printing averages of quantities deep within the algorithm. You can’t do this on the GPU without the code potentially changing register (and shared/local) memory counts, which can affect not just speed but even launch configuration.

  4. It’s much faster than cuda-GDB, by a factor of 14 on my code. I can’t sit around waiting for cudaGDB.

Is the emulator perfect? Not at all. But I use it a lot and am sad to see it go.

The other tool I always use is Ocelot. It doesn’t replace the emulator, but it does do a terrific job especially with memory access error detection.

Maybe Fermi and toolkit 3.1 or 4.0 will satiate me, though… rumor is there’s a printf (like?) support, and with the caches and register spilling it should be OK to add more code like asserts and stats building without breaking a kernel by using limited register resources.

Some questions & observations ;)

  • finding syntax errors must be faster than that, the slow part for me is in ptxas, but then nvcc is already done.
  • you are developing your algorithms on gpu? for me cuPrintf is for finding why my outcome is deviating from my MATLAB version. (okay, except for 2 programs where I cannot build a non-gpu version)
  • did you try cuda-gdb recently? I have never been able to use it, because of compilation time around the 2.1 days (normal compilation took already 30 minutes…) and never had a need afterwards, but a colleague told me that compilation for cuda-gdb has sped up greatly around 2.3, and also runtime was much better.
  • to your point 3, you can now add your own profiler counters. So that may help a lot for stats.

Hi Gregory,

You may perhaps be interested in the files contained in https://gforge.inria.fr/scm/viewvc.php/trun…32/?root=starpu . This is some wrapper for mingw written by some colleague, basically, you can still use your normal pthread code on top of windows without pain. This integrates quite well with autotools too … There are perhaps some features of pthread that are not implemented yet, but it took like a couple hours to port StarPU on windows with this header.

Cédric

You’re right that pure syntax errors are found just as fast with regular compiles… it is indeed ptxas that’s slow. What really kills my workflow is really the feedback I get from errors quickly found by actually running the kernel and the runtime reports an issue and aborts.

cuPrintf does work except it’s limited and awkward. Especially if there’s a memory corruption or crash, your printf buffer is lost too, so you don’t have any log just when that log is most useful! It also has a limited length per string and limited number of strings. Those can be boosted but they still start hitting limits. This isn’t because cuprintf is bad, it’s just because of the batch buffer design.

I should give cudaGDB a try again, but I have tried it in 2.3 and gave up in one run. I should test it more before making hasty decisions.

You’re also right about profiler counters… in my Monte carlo code I drop in emulator-only fancier statistics gathering, for example finding the mean and std of the number of times certain voxels are loaded, the min and max number of datapoints which can operate in the voxel, etc. So those are more algorithmic stats, not programming level frofile stats (which are also useful of course.)

Has anyone actually used cuda-memcheck sucessfully? I cannot find any documentation on it anywhere except for the command line options help printed when cuda-memcheck is run without any options. A quick test on my application produces an empty output file and nothing else.

What does it do?
What output should it produce?
What kinds of errors can it find?
What kinds of errors can it not find?

It sounds like a great tool, but without documented answers to these simple questions, it is useless.

Actually yes. Ther is a great tutorial on Dr Dobbs site http://www.drdobbs.com/high-performance-co…?queryText=cuda

on the subject. It is section 16 on GPU tutorial.
The output is the same as valgrind for C programs.

Los Alamos… I remember it is a place where extraterrestrials from crashed in Roswell spaceship develop nuclear weapons and many other alien technologies. Of course, bugs in thier code are very sophisticated.

I’ve used it a few times. Mostly for trying to locate out of bound memory reads/accesses which turn up because of a crashing program or faulty results. It shows if the access is read or write and the size of the data type. Pretty useful to narrow it down to a specific code line relatively fast.

So I have a question about cuda-memcheck. I was playing around with a simple kernel and generating out of bounds memory accesses to make sure I understood how it works and what it reports.

[codebox]#include <stdio.h>

global void myKernel(float *a, float *b, int n)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

b[idx] = 2.f*a[idx];

}

int main(int argc, char **argv)

{

float *a_d, *b_d;

int n;

dim3 grid, block;

cudaError_t stat;

n = 16;

cudaMalloc((void **)&a_d, n*sizeof(float));

cudaMalloc((void **)&b_d, n*sizeof(float));

grid.x = 1;

block.x = 256;

myKernel<<<grid, block>>>(a_d, b_d, n);

cudaThreadSynchronize();

stat = cudaGetLastError();

printf(“stat = %s\n”, cudaGetErrorString(stat));

return 0;

}

[/codebox]

I was expecting to see 240 invalid reads and writes but I only ever see one invalid read by one thread, and the thread number can change each time you run it. Sometimes it is 32, 128, 160, etc.

[codebox][legrespa@gpgpu ~]$ cuda-memcheck ./a.out

========= CUDA-MEMCHECK

stat = no error

========= Invalid read of size 4

========= at 0x00000028 in myKernel

========= by thread 160 in block 0

========= Address 0x00101280 is out of bounds

=========

========= ERROR SUMMARY: 1 errors

[legrespa@gpgpu ~]$ cuda-memcheck ./a.out

========= CUDA-MEMCHECK

stat = no error

========= Invalid read of size 4

========= at 0x00000028 in myKernel

========= by thread 32 in block 0

========= Address 0x00101080 is out of bounds

=========

========= ERROR SUMMARY: 1 errors

[/codebox]

So it seems like what cuda-memcheck reports is the first, and only the first, invalid memory access of what may be many invalid memory accesses?

So I have a question about cuda-memcheck. I was playing around with a simple kernel and generating out of bounds memory accesses to make sure I understood how it works and what it reports.

[codebox]#include <stdio.h>

global void myKernel(float *a, float *b, int n)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

b[idx] = 2.f*a[idx];

}

int main(int argc, char **argv)

{

float *a_d, *b_d;

int n;

dim3 grid, block;

cudaError_t stat;

n = 16;

cudaMalloc((void **)&a_d, n*sizeof(float));

cudaMalloc((void **)&b_d, n*sizeof(float));

grid.x = 1;

block.x = 256;

myKernel<<<grid, block>>>(a_d, b_d, n);

cudaThreadSynchronize();

stat = cudaGetLastError();

printf(“stat = %s\n”, cudaGetErrorString(stat));

return 0;

}

[/codebox]

I was expecting to see 240 invalid reads and writes but I only ever see one invalid read by one thread, and the thread number can change each time you run it. Sometimes it is 32, 128, 160, etc.

[codebox][legrespa@gpgpu ~]$ cuda-memcheck ./a.out

========= CUDA-MEMCHECK

stat = no error

========= Invalid read of size 4

========= at 0x00000028 in myKernel

========= by thread 160 in block 0

========= Address 0x00101280 is out of bounds

=========

========= ERROR SUMMARY: 1 errors

[legrespa@gpgpu ~]$ cuda-memcheck ./a.out

========= CUDA-MEMCHECK

stat = no error

========= Invalid read of size 4

========= at 0x00000028 in myKernel

========= by thread 32 in block 0

========= Address 0x00101080 is out of bounds

=========

========= ERROR SUMMARY: 1 errors

[/codebox]

So it seems like what cuda-memcheck reports is the first, and only the first, invalid memory access of what may be many invalid memory accesses?

Yeah, I tried out cuda-memcheck now too. Really easy to setup and even works in Release builds. Introduced a simple out of bounds read in the middle of a huge app and it showed up right away.

There is still the open question of documentation on what it reports and what it doesn’t report, and it would be useful if it could return a given exit code on error (like valgrind) for automated testing scenarios. But all around a very useful and quick tool that will become my first step when debugging code that behaves strangely.

Yeah, I tried out cuda-memcheck now too. Really easy to setup and even works in Release builds. Introduced a simple out of bounds read in the middle of a huge app and it showed up right away.

There is still the open question of documentation on what it reports and what it doesn’t report, and it would be useful if it could return a given exit code on error (like valgrind) for automated testing scenarios. But all around a very useful and quick tool that will become my first step when debugging code that behaves strangely.

Well, it is just one line that has an error ;)

Well, it is just one line that has an error ;)