NVIDIA has hade a huge mistake with HW debugger Single-GPU debugging not supported and no emulation&

Greg_Ross · August 1, 2010, 9:59am

Makes sense to me…force the professional developers to buy two Fermi cards.

/sarcasm

Ken_Domino · August 1, 2010, 11:23am

I was able to compile Ocelot on Windows through the Cygwin environment last week, and build a library. But, I ran into the problem of incompatible name mangling between nvcc/cl.exe and g++ so I couldn’t do a link. I could probably modify the library to fix this and move past that particular problem, deal with calling convention issues, but I left it there because I have so many other things to do, and because I think there is a better way.

Unless I’m mistaken, I think the real issue is that nvcc compiles only through the MS compiler on Windows, so it’s tied into MSVC. Ideally, it would be nice if NVIDIA could let us specify the compiler tools for nvcc, not just the directory. It probably wouldn’t be that hard for the guys to do this because they already have a GCC target for nvcc on Linux, and I suspect that they have an option to do this but we don’t know what it is. Then, all sorts of IDE’s and debugging environments could be opened up, like Eclipse or Netbeans. Ocelot could then be a plug in for one of those GUI’s. An alternative to nvcc it seems is to write a script or command line driver that does much of what nvcc says it does when compiling with a -v, and just chuck nvcc. This seems really easy to do actually, because all nvcc calls is cudafe, filehash, ptxas, fatbin, nvopencc, cudafe++, and of course, the MSVC compiler cl. The issue would then become how to link with the NVIDIA’s CUDA runtime library. But, if only emulating in Ocelot, this would probably work because you would just link with Ocelot, a proxy for the CUDA library.

I found out that cudafe on Windows assumes the source is MSVC when parsing, and there is no option that we can set to change this (“–a”, “–b”, “–c” … gets a list of specific options for the ambiguous cmd line args), even though it seems the Edison Design Group parses anything. Cudafe doesn’t parse g++ preprocessor output.

Gregory_Diamos · August 4, 2010, 11:07am

So I talked with vinod grover from nvidia over the weekend, and came to the conclusion that building ocelot on windows would probably be a good idea, if for no other reason than to improve the stability of the code and test it on multiple platforms. I started working on it on sunday night, and finished tonight after work. Here is a link to the first static library that I was able to build, (the most annoying part was a lack of rvalue support in visual studio). Right now only the emulator sources are included as they were easier than building llvm and linking against the CUDA/CAL drivers. I’ll merge the modified sources into the ocelot trunk and try testing it a bit tomorrow.

If anyone wants to try it out, I am expecting a ton of bugs, but it is closer to being functional than it was last week.

[url=“http://www.gdiamos.net/files/gpuocelot.lib”]http://www.gdiamos.net/files/gpuocelot.lib[/url]

eyalhir74 · August 4, 2010, 11:12am

So I talked with vinod grover from nvidia over the weekend, and came to the conclusion that building ocelot on windows would probably be a good idea, if for no other reason than to improve the stability of the code and test it on multiple platforms. I started working on it on sunday night, and finished tonight after work. Here is a link to the first static library that I was able to build, (the most annoying part was a lack of rvalue support in visual studio). Right now only the emulator sources are included as they were easier than building llvm and linking against the CUDA/CAL drivers. I’ll merge the modified sources into the ocelot trunk and try testing it a bit tomorrow.

If anyone wants to try it out, I am expecting a ton of bugs, but it is closer to being functional than it was last week.

http://www.gdiamos.net/files/gpuocelot.lib

Amazing :)

Can you please elaborate a bit how to use it? Just link it to the application and ???

thanks

eyal

Gregory_Diamos · August 4, 2010, 11:48am

Ideally you should just be able to link you application against it rather than against cudart.dll. Again, I’m expecting some problems, and will try running some unit tests after work today…

Ken_Domino · August 4, 2010, 4:12pm

So I talked with vinod grover from nvidia over the weekend, and came to the conclusion that building ocelot on windows would probably be a good idea, if for no other reason than to improve the stability of the code and test it on multiple platforms. I started working on it on sunday night, and finished tonight after work. Here is a link to the first static library that I was able to build, (the most annoying part was a lack of rvalue support in visual studio). Right now only the emulator sources are included as they were easier than building llvm and linking against the CUDA/CAL drivers. I’ll merge the modified sources into the ocelot trunk and try testing it a bit tomorrow.

If anyone wants to try it out, I am expecting a ton of bugs, but it is closer to being functional than it was last week.

http://www.gdiamos.net/files/gpuocelot.lib

Doesn’t link because of naming problems. E.g., cudaMalloc@8 (expecting) vs. cudaMalloc (defined in your static library).

Are you using the right linkage?

ocelot/cuda/interface/cuda_rutime.h defines cudaMalloc as:

extern cudaError_t cudaMalloc(void **devPtr, size_t size);

But, in the Nvidia’s CUDA library, the function is defined (in cuda_runtime_api.h) as:

extern host cudaError_t CUDARTAPI cudaMalloc(void **devPtr, size_t size);

where CUDARTAPI is defined with “define CUDARTAPI __stdcall”.

Gregory_Diamos · August 4, 2010, 5:40pm

Thanks for the update. This evidently didn’t matter on linux. I’ll go back and make this change tonight.

dcbarton · August 4, 2010, 11:51pm

Not ignoring it, but remember that cuprintf is not included in 3.1, you have to sign up as a developer and very seriously look for it, knowing what your looking for.

It isn’t that I can’t debug at all, it’s just that it is now painful and not even part of their cuda sdk unless you have fermi, and 2 fermis to do any kind of professional debugging like… gasp… breakpoints… when the last rev had a decent solution for most bugs.

My biggest problem is to see such amateurish execution. It’s very difficult to see such steps backward when nvidia is trying to push cuda development forward.

Finally, there are many cases where printf’s crash my system in complex kernels, even simple ones just spitting out a few ints.

dcbarton · August 4, 2010, 11:54pm

I think his mistake is that he got too passionate and I can totally relate to that - but I guess we should try to stay calm in those forums.

I dont think that cuPrintf or fermi’s printf is a good solution - this is indeed not serious. I know I use it to debug sometimes, even plain C++ and

not CUDA, its just not serious.

I think that nVidia has indeed taken the short/easy path here. NSIGHT and Ocelot are not suitable for many people for a whole lot of reasons, I think:

don’t linux/unix, don’t use/want to use 2 computers/gpus, ease of installing, time…

Emulation was perfect with regard to those issues, just add a compile directive and you’re set.

I also think that it would be very helpful and actually a must if nVidia supplies something good, easy and WORKING so that developers can

debug CUDA just like they debug “regular” code. If nVidia can’t or doesn’t want to do so, I think it would be best if nVidia can get 3rd party

companies to do that for them.

Another small example that shows what also lacks is this: I have a 20 SXXXX machines in production, there is no serious tool to manage

that. I know there is something “Bright something”… but there is nothing that for example can give me the overall performance of my GPU

cluster - how busy the GPUs are, occupancy, etc…

If nVidia is serious about going into HPC, those tools, in production level must start to come up.

How can someone seriously debug a remote production system?

My additional cent :)

eyal

dcbarton · August 5, 2010, 12:05am

I think his mistake is that he got too passionate and I can totally relate to that - but I guess we should try to stay calm in those forums.

I dont think that cuPrintf or fermi’s printf is a good solution - this is indeed not serious. I know I use it to debug sometimes, even plain C++ and

not CUDA, its just not serious.

I think that nVidia has indeed taken the short/easy path here. NSIGHT and Ocelot are not suitable for many people for a whole lot of reasons, I think:

don’t linux/unix, don’t use/want to use 2 computers/gpus, ease of installing, time…

Emulation was perfect with regard to those issues, just add a compile directive and you’re set.

I also think that it would be very helpful and actually a must if nVidia supplies something good, easy and WORKING so that developers can

debug CUDA just like they debug “regular” code. If nVidia can’t or doesn’t want to do so, I think it would be best if nVidia can get 3rd party

companies to do that for them.

Another small example that shows what also lacks is this: I have a 20 SXXXX machines in production, there is no serious tool to manage

that. I know there is something “Bright something”… but there is nothing that for example can give me the overall performance of my GPU

cluster - how busy the GPUs are, occupancy, etc…

If nVidia is serious about going into HPC, those tools, in production level must start to come up.

How can someone seriously debug a remote production system?

My additional cent :)

eyal

I think it’s important for nvidia to hear the passion. They need to hear the real frustration, time and money they are costing serious developers, at a crucial time when they truly need developers, so I’m being very vocal on these boards hoping they are reading. I work on commercial, production and consumer systems and the target machine is never fermi (hpc product) but either gtx480 or lesser (no fast doubles). I get the feeling nvidia thinks all development is in an academic lab where phd candidates are willing to go to any platform and toolchain to make things work. This simply ignores the realities of production work and limited schedules.

I’m their biggest fan and have pushed cuda into many real world solutions for some big companies but if they don’t start doing a better job, I won’t be able to advocate for them with a straight face. If they can’t keep their biggest fan and cheerleader, they’re in a world of trouble.

Gregory_Diamos · August 6, 2010, 4:27pm

So I was able to get a few examples linked against ocelot and pass the built-in regression tests. I also tried a few examples with memory errors to see if Ocelot could detect them correctly. They are being detected, but the mechanism used to report errors, exceptions, is handled differently on windows (the default exception handler never calls what()) on std::exceptions so you don’t get any intelligent error messages. I’m going to leave the behavior as is on ocelot in order to allow people to catch and handle errors or get a useful error message. Someone at MS should fix that btw, the default behavior in GCC is far more useful.

The sources will be merged with the google code trunk over the weekend. In the meantime, you can download my debug build from http://www.gdiamos.net/files/gpuocelot.lib

Gregory_Diamos · August 6, 2010, 10:41pm

One last update before I move on to other things:

This is just an update to let you know that I created a new branch ocelot-windows that includes visual studio files and code changes to allow ocelot to build on windows. I also back merged the changes that do not impact the quality of the code into the trunk.

Notable changes to the trunk:

We now use boost::threads rather than pthreads. This adds a dependency on libboost_thread. I would still recommend using hydrazine::Thread rather than boost directly though, due to how horrible writing programs with locks is and that hydrazine::Thread abstracts everything as messages.

some system specific functionality like getting the number of hardware threads has been moved into wrappers in hydrazine.

If you want to play around with the windows branch, check out the branch and have a look at the msvs/gpuocelot folder for a series of VS projects. Also, take care when modifying the flex/bison sources for the parser. VS has serious problems compiling their outputs, so much so that I had to write wrapper programs to pass over the generated .cpp files for the lexer to make them acceptable by VS. These don’t really integrate seamlessly with VS, so any changes to the parser/lexer may require modifying the custom build rules for flex/bison or the wrapper programs.

Let me know if anyone has any suggestions or comments. I am currently looking for someone to pick this up and maintain it, and eventually merge it into the trunk if the windows specific functionality can be abstracted behind a system library. Preferably someone who uses windows for CUDA development.

Gregory_Diamos · August 6, 2010, 10:41pm

One last update before I move on to other things:

This is just an update to let you know that I created a new branch ocelot-windows that includes visual studio files and code changes to allow ocelot to build on windows. I also back merged the changes that do not impact the quality of the code into the trunk.

Notable changes to the trunk:

We now use boost::threads rather than pthreads. This adds a dependency on libboost_thread. I would still recommend using hydrazine::Thread rather than boost directly though, due to how horrible writing programs with locks is and that hydrazine::Thread abstracts everything as messages.

some system specific functionality like getting the number of hardware threads has been moved into wrappers in hydrazine.

If you want to play around with the windows branch, check out the branch and have a look at the msvs/gpuocelot folder for a series of VS projects. Also, take care when modifying the flex/bison sources for the parser. VS has serious problems compiling their outputs, so much so that I had to write wrapper programs to pass over the generated .cpp files for the lexer to make them acceptable by VS. These don’t really integrate seamlessly with VS, so any changes to the parser/lexer may require modifying the custom build rules for flex/bison or the wrapper programs.

Let me know if anyone has any suggestions or comments. I am currently looking for someone to pick this up and maintain it, and eventually merge it into the trunk if the windows specific functionality can be abstracted behind a system library. Preferably someone who uses windows for CUDA development.

jack · August 7, 2010, 6:40pm

If you really wanted to, you could link the application against cudart.dll and still use Ocelot (on demand). I don’t know how to do it on Linux, but I could pretty easily write the Windows code if you wanted it (in fact, I was actually planning to write something similar, and a majority of the code will be reusable).

jack · August 7, 2010, 6:40pm

If you really wanted to, you could link the application against cudart.dll and still use Ocelot (on demand). I don’t know how to do it on Linux, but I could pretty easily write the Windows code if you wanted it (in fact, I was actually planning to write something similar, and a majority of the code will be reusable).