CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation

Device emulation support in the CUDA C Runtime will be deprecated as of the CUDA Toolkit 3.0 production release.

Now that more sophisticated hardware debugging tools are available and more are on the way, we will be focusing on supporting these tools instead of the legacy device emulation functionality.

On Linux, use cuda-gdb and cuda-memcheck. Third-party solutions from Allinea and TotalView will also be available soon.

On MacOS, use cuda-memcheck. We’re working a cuda-gdb port to MacOS for a future release and will provide a preview to all GPU Computing Registered Developers as soon as it’s ready.

On Windows, use the new Visual Studio integrated debugging and profiling tools code-named “Nexus.” Please see for details.

Deprecating the device emulation feature in this release means no further development or bug fixes for this feature will be made after this release, and device emulation will be removed entirely from the CUDA Toolkit 3.1 release.

I admit I abuse emulation mode for a lot of algorithm instrumentation… sometimes I stream out extra statistics or progress traces in emulation mode only (things like snapshotting all pending rays to a file at a particular point in the computation.) I do this for debugging but also for things like visualizations. Emulation mode lets me stick these extra steps into the compute since I can call extra host-only functions as needed.

This will likely still be possible using the GPU but now it may need an extra layer, similar to cuPrintf(), and it will clearly not be as easy. Alternatively I need to start looking into the advanced features of Ocelot.

Will Fermi be supported with the 3.0 toolkit (and therefore have legacy emulation ability) or will Fermi never be emulatable?

Emulation is its own target, so there is no distinction between Tesla and Fermi. But 3.0 should provide emulation support for all the features that Fermi brings.

I meant the new Fermi-specific extensions that are not in the current 3.0 beta toolkit, things like setting the shared memory mode, launching multiple kernels per GPU, specifying on-chip atomic globals, etc.

Device emulation has nothing to do with any of that. Basically, “device emulation” is a misnomer–it never emulated a G80 or any other chip. Instead, it compiled for the CPU using the most obvious ways possible (this is why it was so slow). As a result, hardware-specific extensions are completely separate from device emulation.

Not that I am using device emulation, so I don’t care at all ;) But using Nexus means using versions of windows that are known to have less performance for CUDA than XP…

I want nexus for linux :( … or else please let the deviceemu be… I do use it a lot when I am debugging my algorithm for math errors… hence I will miss it. :(

why not use gdb with ddd as graphical frontend???

As E.D. Riedijk noted, does this mean that windows developers are forced to use >=Windows Vista, >=Visual Studio 2008 and >=2 GPUs just to be able to debug their code? This can be quite a limitation. Especially the 2 GPU part. Even if you have two computers each running a GPU, you are blocking both computers during debugging.
Even a ported cudadbg would be helpfull for Windows developers. Or developers group together and quickly port gpuocelot :D.
Also a quick question about the developer page. I have created an account for Nexus through the developer page, but I can’t use my account there. Should I do it again to gain access to those resources or is a simpler way possible?

I don’t see Visual Studio 2008 as limitation and debugging on the real hardware is quite important, since errors might only occur running your code on the gpu and not in emulation mode. At the moment I have such a problem and I don’t know how to locate the line of code that causes this behavior.

It’s very probably impossible to run you graphics output and the debugger on one GPU. To inspect the memory on the GPU the debugger has to stop the execution on the GPU and what will happen to your graphics output? So you need 2 GPUs.

Windows Vista / Windows 7 might be a limitation. NVIDIA don’t want to do the work twice an support two kinds of driver models for debugging and I don’t know if it’s even possible to do that with the XP driver model. The overhead for starting a kernel is at the moment higher on vista / win 7, but that overhead might decrease in future. You can also develop your CUDA applications with vista and run your code later on Windows XP.

The hardware and software requirements are higher, but in imho it’s worth. Hardware debugging was on top of my own feature request list.

As an alternative, for an X-application (in Linux!) the second card could also be the one on your laptop (if you have one) - in which case just about any old piece of techno-trash will do.

is there any chance that the emu code will be opensourced ? once its deprecated of course …

Great point + 1 for that question ? ( I would love it… :teehee: )

Hmm thanks very much :) … I dint knew much about DDD… (am a non cs student) … looks nice… how complex is it to use ?

I understand the 2 GPU constraint, but currenlty under linux you can use a second GPU in your system. For Nexus you need another box. And the problem with the new OS/Compiler is mainly convenience. I would like to maintain the same system as team, and they will surely not upgrade just for me.
I know it is required, but less limitations would help alot.

well actualy u don’t need a second box for nexus, the main idea is that you can’t debug on the same gpu that u are using for display, which makes sense. so either you use a gpu from another box or on the same box. The display gpu can be a simple one. if its the same box it needs to be an nvidia one. I do hope that the emu will continue in some form. that way you can always do some work even on a box without a nvidia gpu. We actualy did just that. Moved the whole team to vs 08 from vs 05 because of Nexus. But considering that vs 10 is in advanced stages of development well its always harder the bigger the gap.

No–if you want something like that, just use Ocelot. It’s light years beyond device emulation anyway.

Thanks for the vote of confidence Tim :)

Unfortunately it’s at the moment only for Linux available :-(.

Is it hard to port it to Windows, maybe because of dependencies?

It would be difficult but not impossible to port to windows. All of the major dependencies (LLVM, boost) have windows support. The main difficulties would be in wrapping the interface to pthreads and linux timers and changing the build system to use something other than autotools. I think that it could be done by one person in a few weeks. Unfortunately, no one in my lab even has windows installed so finding someone to actually do it would be the biggest problem.