NVIDIA has hade a huge mistake with HW debugger Single-GPU debugging not supported and no emulation&

I develop on a laptop. Dual GPU is REQUIRED to use HW debugging emulation mode is no longer available!!! You can’t be serious, NVIDIA. Very amateurish. Loopks like I get to use printf’s, at least they threw me that bone, not that it works. It doesn’t.

Stick with CUDA 2.3 for the time being. Is there anything in the CUDA 3.x API that you really need? Fermis are pretty much unobtainium in laptops, I think.

Would your laptop be equipped with Optimus technology i.e. Intel integrated graphics in parallel with an nVidia chip ?

cuPrintf does actually work (I use it all the time). printf from the device needs fermi hardware.

And the fact that two GPUs are required for hardware debugging is not very strange is it? It is not necessary that they are both in the same machine however. NVIDIA might argue that emulation mode was amateurish and that professional people have access to at least two machines for debugging purposes…

If emulation mode correctly identified the vast majority of problems that people encountered when writing kernels, it would not have been removed. But, it didn’t, and it was ultimately incredibly misleading as a tool for developers.

If you really need to run on the CPU, Ocelot is probably the way to go.

I must admit that I miss emulation mode a lot too…

Ocelot is for unix/linux - I mostly develop on windows.

Nsight is too cumbersome, you need 2 gpus, set it up, once you’re done you have an idle computer just for debugging.

Even for a company like where I work this is a lot of money sittting idle most of the time (I know it can be used to other

stuff when i’m not debugging but if someone takes it to another task it won’t return to being a debugging machine) plus

only one person can debug at the same time.

I just miss putting a breakpoint in Visual Studio, playing with the blockIdx/threadIdx values and stepping through in a visual

debugger seeing why my index calucalutions or input data is off by one sample…

I understand that emulation mode took too much time maintaining it and that it didnt solve out-of-bounds memory and

race condition issues, but now to find an index calculation error I need NSight or Ocelot, none of them is that easy.

What about a light-emulation version? with minimum support from nVidia that will not take so much resources from nVidia

but give developers easy and fast way to debug such issues???

please, please, pretty please… :)

Well anyway just my one cent…


Btw, hardware debugging works only on 4 streaming multiprocessors of 14 as far as I know.

You keep saying that, but was the question ever asked to actual developers? Emulation mode was immensely useful for finding your everyday bug or typo in a given expression. Not every debugging session is about something inherently parallel, sometimes you just need to get the code working. Or you can look at code for hours while assuming that a given expression actually has the right values for its operand while a quick look with emudebug and an hover over the variable shows you that is has a wrong value to begin with.

A client of mine uses an IDL program to call a cuda library. There is no way nsight can debug that. He owns GF100 hardware. I had to tell him he was pretty much screwed and to stick with cuda 2.3 where he can put a breakpoint in the device code and attach to process.

You also said that keeping emulation mode “took a lot of time and slowed the overall cuda progress”. Killing something because its “hard” doesnt exactly seem like a good idea to me. I dont think ive ever read anyone say that emudebug was useless since it couldnt deal with deadlocks or race conditions. In fact, when your code works in emudebug and doesnt in debug mode, you can pretty much assume its one of the two and actually start looking for exactly that.

Unless nvidia does the work itself or pays someone to port ocelot to Windows, it is not a viable suggestion.

Personally, once cuPrintf was around, I didn’t miss emulation at all. And I did get bitten by the “no race in emulation, but race in parallel” problem - not to mention some things carefully optimised on the assumption that warps processed simultaneously going awry.

If Ocelot was available for windows as a separate set of libraries that you would link against in ‘debug’ configurations, would you actually use it? If not, what additional capability would change your mind?

I think that as long as it would be plug-n-play (as simple as adding -deviceemu to the command line) , clicking the mouse to

toggle on/off breakpoints and see the variables values I’d be extermly happy.

Like Ailleur said, most of the bugs I encounter are “stupid” indexing stuff and not race-condition et al.

I have a lot of experience in unix but installing Ocelot will never be as easy as just changing a drop-down list to debug,

click the mouse to toggle a break-point and click run - unless you indeed write a plugin for Ocelot ;)

Currently the other options are either to use cuPrintf or dump data from device to data files… its so 60’s ;)


So what you really want is an interface in visual studio so that you can step into kernels, set breakpoints visually in an IDE, and inspect the values of variables.

That is within the realm of things that are possible to do. However, if it would take me a few days to build ocelot on windows (probably a week since I would actually have to find/buy/setup a windows machine), doing that would be more like a 3+ month effort. Ocelot would have to read the debugging information embedded in a PTX file and expose an interface to visual studio for determining the addresses of symbols, setting breakpoints, examining values stored in registers, etc. Not impossible, but more the type of thing that would require paying some good people full time for at least 3-6 months, depending on the person/people involved.

Furthermore I have no idea what writing a plugin or extension to visual studio like that involves in terms of licensing. Even beyond the technical difficulties, which would be similar if I wanted to do that same thing for gdb on linux, I’m not sure if someone could even release an extension like that without paying a licensing fee to microsoft or passing it on to you.

A quick google search led to this:

“Note: Creating a new language extension to the debugger requires a VSPackage license supplied under the terms of the VSIP program.”

This type of thing is why I will never write an extension to a microsoft product even if you pay me.

Perhaps this could be an interesting extension to parallel-nsight as they have probably already jumped through most of these hoops?

Thats exactly what Ailleur said, and I totaly agree with him:

Such a tool would be excelent :)


Not that cuda-gdb is all that useful on linux anyways. It still cannot debug codes that use textures!

Nsight’s feature sheet says that it can (though I don’t have a windoze box w/ VS2010 to test it on). Why is cuda-gdb left by the wayside?

I was able to compile Ocelot on Windows through the Cygwin environment last week, and build a library. But, I ran into the problem of incompatible name mangling between nvcc/cl.exe and g++ so I couldn’t do a link. I could probably modify the library to fix this and move past that particular problem, deal with calling convention issues, but I left it there because I have so many other things to do, and because I think there is a better way.

Unless I’m mistaken, I think the real issue is that nvcc compiles only through the MS compiler on Windows, so it’s tied into MSVC. Ideally, it would be nice if NVIDIA could let us specify the compiler tools for nvcc, not just the directory. It probably wouldn’t be that hard for the guys to do this because they already have a GCC target for nvcc on Linux, and I suspect that they have an option to do this but we don’t know what it is. Then, all sorts of IDE’s and debugging environments could be opened up, like Eclipse or Netbeans. Ocelot could then be a plug in for one of those GUI’s. An alternative to nvcc it seems is to write a script or command line driver that does much of what nvcc says it does when compiling with a -v, and just chuck nvcc. This seems really easy to do actually, because all nvcc calls is cudafe, filehash, ptxas, fatbin, nvopencc, cudafe++, and of course, the MSVC compiler cl. The issue would then become how to link with the NVIDIA’s CUDA runtime library. But, if only emulating in Ocelot, this would probably work because you would just link with Ocelot, a proxy for the CUDA library.

So that is may be an option for people who are comfortable with cygwin. I think that it would also be relatively easy to do with Visual Studio or Intel’s compiler…

yeah, just have something like an Ocelot-Debug option in your IDE and it would pick libocelot.so instead of libcudart.so.

Yes, I’m ranting, but NVIDIA will do well to listen. I do nothing but CUDA work for a lot of clients and my own stuff…

My laptop no longer has emulation mode NOR does it have printf’s. printf’s seem to belong to fermi only. My fermi box does have printf’s. I no longer have any debugging options unless I use fermi and I have yet to be able to set a breakpoint using remote debugging. Day 3 now.

NVIDIA has made amateur assumptions about how developers work, specifically, they seem to think that anything running on a fermi will be developed on a fermi. Worse, they seem to think that nothing will be developed on less than a fermi from now on. no emulation, no printf. no nothing for g92 nor even single fermi boxes.

Emulation mode solved 90% of my problems, except thread interaction issues, which are completely understandable. If they removed it because it was confusing, well, is your current situation less confusing? They can’t be serious.

NVIDIA’s developer/tools folks are sorely lacking some basic marketing and real-world, production development expertise. Nice collateral and web sites, very poor execution. Seems like college-kids running the asylum to me. This is exactly what I would expect to see from a bunch of inexperienced but smart new college grads with no business experience and I’m pissed. I’ve put a lot of faith in nvidia and my livelihood depends on them being professional and competent.

I have yet to get remote debugging working. I’m a smart guy with many large, successful cuda projects under my belt. If I’m having trouble, well, there’s just no excuse. i just waited 20 minutes for an NSIGHT compilation of a single file that took1 minute using straight 3.1. Only to then get the message that $(NSIGHT_CUDA_INC_PATH) could not be found.

Anyone see the setup for remote debugging? Are you serious, NVIDIA? This is a commercial product? This smells like a amateur hodgepodge, more like from AMD and stream computing.

NVIDIA would do well to do fewer things well. They are severly lacking in execution across the board right now and it is effecting me greatly, and I’m the friendliest audience they will ever find!

If I do, finally, get remote debugging working, will I not be able to debug my dll? This is an extremely common design pattern, where a dll is built in order to speed up a legacy system. I have a file that takes 20 minutes, not an exaggeration, to compile. Takes 1 minute under straight 3.1 without nsight, and I still get even set a successful breakpoint in a kernel.

  1. NVIDIA removed emulation mode because it was hard for them to maintain (or more precisely, keep synchronized with new features)

  2. You are right that is was pretty useful

  3. I found that with cuda 3.1 you can get nsight enabled compilation by just adding the -G0 to the compiler option instead of going through all the loops in the documentation. For some reason I can’t get any trace information no matter which loops I jump through, but debug works fine with a lot less hustle

  4. It does require two GPUs on the machine. I don’t know about fermi, but I use a single machine with gt240 and test c1060 to debug on. Need to test remote debugging again from the laptop though …

You seem to be ignoring cuPrintf. It just works, okay you have to add 3 statements to your overall Cuda program, but after that , no problem.

I think his mistake is that he got too passionate and I can totally relate to that - but I guess we should try to stay calm in those forums.

I dont think that cuPrintf or fermi’s printf is a good solution - this is indeed not serious. I know I use it to debug sometimes, even plain C++ and

not CUDA, its just not serious.

I think that nVidia has indeed taken the short/easy path here. NSIGHT and Ocelot are not suitable for many people for a whole lot of reasons, I think:

don’t linux/unix, don’t use/want to use 2 computers/gpus, ease of installing, time…

Emulation was perfect with regard to those issues, just add a compile directive and you’re set.

I also think that it would be very helpful and actually a must if nVidia supplies something good, easy and WORKING so that developers can

debug CUDA just like they debug “regular” code. If nVidia can’t or doesn’t want to do so, I think it would be best if nVidia can get 3rd party

companies to do that for them.

Another small example that shows what also lacks is this: I have a 20 SXXXX machines in production, there is no serious tool to manage

that. I know there is something “Bright something”… but there is nothing that for example can give me the overall performance of my GPU

cluster - how busy the GPUs are, occupancy, etc…

If nVidia is serious about going into HPC, those tools, in production level must start to come up.

How can someone seriously debug a remote production system?

My additional cent :)