Is emulation mode removed from CUDA 3.0?

This is very true. I can tell you from years of reading the CUDA forum (especially in the early days before cuda-gdb existed) that there were dozens and dozens of posts that started “This code works in the emulator, but fails on the device.” Most all of those problems were caused by:

  • Passing host pointers to device, which is fine in emulation

  • Race conditions hidden by emulation, which serializes threads much more than the device.

  • Immediate launch failures due to more fundamental driver issues that were missed because the poster had no error-checking code at all.

Ocelot would certainly catch the first two problems. Forcing people to check errors is beyond the scope of any program, I think. :)

To be honest, the shortcomings of emulation mode and the utility of Ocelot really point to an obvious solution for NVIDIA:

They should hire some people (and/or try to tempt you with a fat paycheck after graduating) to turn Ocelot into a supported part of the toolkit. It detects a whole host of errors automatically, like a GPU valgrind of sorts, and lets people run CUDA code in a realistic manner without an actual CUDA device.

Seriously, Ocelot should be in the toolkit. Given NVIDIA’s push to grow CUDA with Fermi, I don’t know why it hasn’t already happened. tmurray: Please pester your bosses with this. Hire Gregory and solve this problem for CUDA 4.0. :)

Better yet: make the toolkit completely opensource - people will solve their problems themselves.

Yeah, I know, it was discussed dozens of times already.

Ocelot is much more advanced than the NVidia emulation mode (of 2.x) anyway, so no need to miss it :) If someone wanted to implement host-side debugging that’s the place to look. As it is a full emulator, even valgrind-like checking could be implemented.

But I also agree with NVIDIA that on-device debugging is a good idea, as you can see it working on the actual hardware, and find problems that only occur on the hardware (such as weird race conditions) that you’d never find in an emulator. cuda-gdb is the best option as soon as the tool matures IMO.

Do you know, that NSight works on only 4 streaming mutiprocessors? Do you? If you set breakpoint, and programm will hit it on other streaming multiprosessor, Nsight will not stop.