Is emulation mode removed from CUDA 3.0?

I used CUDA 2 before and emulation mode worked great with Visual Studio. I have new PC now, just installed newest SDK and can’t make emulation work. I set this option -deviceemu -D_DEVICEEMU but Visual Studio tells me that can’t find sources when i try to enter the Kernel. I used flag --keep to keep the intermediate files, but then Visual Studio goes to some generated and meaningless (for me) files instead of kernel.

What I might be doing wrong? Did something change from previous version.

In file CUDA_Release_Notes_3.1.txt I found:

New Toolkit Features

o Device emulation has been removed.

does it mean that there is no emulation now? How to debug then?

There’s cuda-gdb that you can use for hardware level debugging. Although personally I’ve found it close to unusable because of all the limitations and conditions your kernel has to meet in order to be debugged by at it, it is currently the best option available.

Alternatively there’s ocelot, which is something akin to a 3rd party emulation mode for CUDA, although its objectives go far beyond being just an emulation mode. Proof of that, its debugging capabilities are sorely lacking at the moment.

Long story short, pragmatically speaking we are stuck with printf and cudaMemcpy on memory states per step. :P

There’s cuda-gdb that you can use for hardware level debugging. Although personally I’ve found it close to unusable because of all the limitations and conditions your kernel has to meet in order to be debugged by at it, it is currently the best option available.

Alternatively there’s ocelot, which is something akin to a 3rd party emulation mode for CUDA, although its objectives go far beyond being just an emulation mode. Proof of that, its debugging capabilities are sorely lacking at the moment.

Long story short, pragmatically speaking we are stuck with printf and cudaMemcpy on memory states per step. :P

Both cuda-gdb and ocelot are for linux, while i am on win.

So you confirm that emulation was removed from cuda? printf can be called in kernel only on fermi :(

Both cuda-gdb and ocelot are for linux, while i am on win.

So you confirm that emulation was removed from cuda? printf can be called in kernel only on fermi :(

There’s also Nsight.

There’s also Nsight.

Is memcpy and then dumping that to a file really that bad? If you use thrust, its basically a two-liner. I often find it very useful since you can see the entire state of your system at once. The key to using this debugging technique is to use the smallest possible dataset you can to reproduce the problem. I find debuggers very annoying especially when you need to step through a loop - it’s easier to just print everything out/dump to a file and then you can see exactly which iteration things went bad.

Is memcpy and then dumping that to a file really that bad? If you use thrust, its basically a two-liner. I often find it very useful since you can see the entire state of your system at once. The key to using this debugging technique is to use the smallest possible dataset you can to reproduce the problem. I find debuggers very annoying especially when you need to step through a loop - it’s easier to just print everything out/dump to a file and then you can see exactly which iteration things went bad.

Which requires 2 GPU to debug. No, thanks.

Which requires 2 GPU to debug. No, thanks.

Porting ocelot to windows would really be a simple task. I think that it could be done in a day by someone who knew the codebase and about a week by someone who didn’t. If anyone has some free time (I don’t) and feels that this has value, I would be very grateful if they could try this out.

As an incentive, I’ll send anyone who can get the SDK regression tests in ocelot to pass on windows XP or longhorn an 8800GTS that I have lying around or buy them dinner at GTC if they are going.

Porting ocelot to windows would really be a simple task. I think that it could be done in a day by someone who knew the codebase and about a week by someone who didn’t. If anyone has some free time (I don’t) and feels that this has value, I would be very grateful if they could try this out.

As an incentive, I’ll send anyone who can get the SDK regression tests in ocelot to pass on windows XP or longhorn an 8800GTS that I have lying around or buy them dinner at GTC if they are going.

I’ll add beers or dinner to that as well. (I can’t really offer hardware, you don’t want my preproduction monster boards)

I’ll add beers or dinner to that as well. (I can’t really offer hardware, you don’t want my preproduction monster boards)

haha, I want one of those, so I can get my microbenchmarks out faster than you release :P

haha, I want one of those, so I can get my microbenchmarks out faster than you release :P

Not only does Nsight require two gpu’s, but it is a poor debugger at the moment. (Maybe later on–who knows when–it might be fine). Some of the problems I have are:

  • You cannot debug both host or device code in the same debugging session. (You can always debug one then the other, but it is inconvenient.)

  • It does not display the point of an illegal memory exception. For example, if you pass a bad pointer to the kernel and try to dereference it, the debugger will not display the line where the exception occurs.

  • I’m not exactly sure what kind of memory problems it is supposed to check, but it doesn’t catch buffer overflow/underflow, or un-freed cudaMalloc memory.

  • Conditional breakpoints are very limited and do not share the same syntax as the MS debugger (almost not useful). If you have conditional breakpoints, you will get useless warning messages debugging either host or device code.

I think it is REALLY BAD DECISION that emulation mode is being removed without a good debugging alternative in place. With emulation mode, you could use the existing MS debugger and single step through kernel code (debug host and device code in the same session, and I could step through the grid in a predictable way), use conditional breakpoints like “j == 1”, display variables just as in Nsight, etc. In addition, you could do development without a NVIDIA card, and use printf’s in kernel code. The alternative presented to us is really a step backwards: a debugger (Nsight) with less features, where printf’s cannot be used, where you have to purchase multiple NVIDIA cards, etc.

Looks like this issue has been visited before (http://forums.nvidia.com/index.php?showtopic=170001&pid=1064752&mode=threaded&start=#entry1064752), and nothing will come of it. Ocelot looks like it might be an alternative, but not sure when that will be ported to Windows (I’m in no mood to do it), and the documentation (http://code.google.com/p/gpuocelot/w/list) looks abysmal–just like a student project. There are no pre-built binaries, see http://code.google.com/p/gpuocelot/wiki/Installation. The “installation” requires you to actually build the damn thing. Wonderful.

I said it somewhere in another thread: this is the price for cutting edge technology, orders of magnitude faster than existing solutions. If you can’t work like this, its not for you.

I disagree. I personally think that the differences in the execution models between an actual GPU and the previous emulation mode ended up masking bugs. Sure you could step through an application in a debugger, but the actual GPU would be doing something completely different. The point of a debugger in my opinion is to give you insight as to what is actually going on in the machine. Additionally the restriction on thread count made debugging any real applications infeasible.

An ideal solution would be to add a gdb or similar interface to an emulator like ocelot or a simulator like gpgpu-sim. This way you could debug on a CPU or GPU, and you could step through a program from the perspective of a warp such that the state that is visible in the debugger would be modified in the same way that it would on real hardware.

It really is just a student project. There are currently 2 students actively working on it and we have absolutely no financial motivation to write good documentation. My only motivation to write any documentation or provide any support for Ocelot is to increase the usability of GPUs, which may make it more likely for people to use CUDA and GPUs over something else and possibly it more likely that there will be enough funding and interest for me to continue working on a topic that I enjoy and believe is a promising solution for multi-core and parallel programming. When I have a choice between writing more documentation or porting code to other platforms I have to weigh that against things like writing my thesis, performing experiments, or enhancing the core of ocelot to prototype new features in CUDA.

We don’t include pre-built binaries because that really doesn’t make sense on linux. Binaries depend on library versions which vary dramatically across different versions of linux (even versions of libstdc++ vary), so binaries are typically tied to a particular release of a particular distribution of linux.

There is one upside though. Ocelot is completely in the public domain at this point. The license permits unrestricted use including for-profit use with only a limit on our liability. If some company really wants a better debugging interface for CUDA, or a windows port then all they would have to do would be to hire someone to write a gdb interface, write some documentation, install it on windows and release a binary.

We are going to put in a bid for a couple 100k in the fall to get NSF to do it for us, but the turn around time on that will be about a year, if they agree to it at all.