A different way to think about writing GPU applications

I’ve been playing around with a refined model of how to program GPUs and I’m interested in
feedback from others who spend a lot of time doing GPU programming. Would this type of
programming model appeal to you?

Check out the source code http://code.google.com/p/vanaheimr/source/browse that I’m describing.

When most people think of GPU Computing, they think of heterogeneous applications that run on a combination of a CPU and a GPU. This allows programmers to use single threaded algorithms on the CPU and parallel algorithms on the GPU, but they must accept extra complexity associated with compiling for different types of processors with different memory spaces. In general I think that complexity is justified for performance critical applications, but most applications are not performance critical. Much of the time it is acceptable to write an application with good asymptotic algorithmic complexity with performance that scales well on future (faster) hardware. Then, if performance is an issue, specific pieces of an application can be tuned. For these applications the explicit control over CPU and GPU portions of code becomes a handicap because dealing with it is a requirement, not an option. For many applications, I think that it would be more desirable to just think about parallelism, synchronization, and data sharing, which are hard enough problems without heterogeneity.

So what would be ideal? For a C++ enthusiast like me, the best case scenario would be to allow me to write applications in vanilla C++11 with two simple extensions.

  • Parallelism (an API call to launch a SPMD kernel over a function pointer).
  • Synchronization (an API call for a barrier over SPMD threads in a group).

With these extensions, a C++ program would begin with a single thread entering main, memory would be managed with malloc/new and drawn from a single shared address space, and the use of parallelism (or not) would be up to the programmer. Heterogeneity would be abstracted away and the programmer would be given the perspective that he is running on a single processor with a large number of cores. If the two previously mentioned extensions were implemented in a library, there would be no need for a domain specific compiler - an off-the-shelf C++ compiler would work. More importantly, it would be possible to compile existing libraries as is, and call them from the GPU device.

So why isn’t it possible to do this today?

Well it is possible to do this today and run applications on a CPU. However, that isn’t interesting because it doesn’t allow me to accomplish my primary goal of implementing forward scalable parallel algorithms that run well on systems with hundreds or thousands of cores.

So instead, why isn’t it possible to do this today on a GPU?

Well, it turns out that it is possible to do most of this today if you are willing to be creative in your use of existing tools, but you need to address the following real or perceived issues.

  • Compilers that target GPUs don't implement all of C++11.
  • GPUs don't allow basic functionality that programs expect that is normally through the standard library (File IO, system calls, etc).
  • Performance of single threaded code running on the GPU will be so bad that it drags down the rest of the program.
  • GPUs need a CPU for some functionality.

So let me address these one at a time.

GPU Compilers Can’t Compile C++11

Well most CPU compilers can’t either (can any?). So the more important question is “can they compile enough of C++11 to be useful?”. The most mature GPU compilers that handle C-like languages are NVCC, command-line wrappers around OpenCL JIT compilers, and a bunch of other options that are far less stable (I’m hoping that CLANG->LLVM->PTX will get some better support, but it can’t even compile my “Hello World” code yet). So I’m still sticking with NVCC.

NVCC ends up working out very well. It has many useful C++ features that haven’t existed on GPUs before they were implemented in NVCC.

  • Basic C/C++ statements and control flow.
  • Function calls.
  • Separable compilation (plus a linker).
  • Correct handling of classes/polymorphism/new/delete.
  • Template metaprogramming.
  • Kernel launches from the GPU (added in Kepler).

It does have some notable missing features that need to be worked around.

  • All functions need to be tagged with __device__ (which is usable but highly cumbersome if you don't want to address heterogeneity).
  • No exceptions.
  • No global constructors/destructors in device code.
  • No variadic functions/templates other than printf.
  • No rvalue references.
  • Several other minor features (constexpr, lambdas, static_assert, etc).

All in all, other than the need to tag everything with device, NVCC has better C++ support than many CPU compilers (out of compilers I use regularly, only GCC, CLANG, and MSVS are clearly better).

GPUs Can’t Implement The Standard Library

Even with decent compiler support in NVCC, most C++ applications lean heavily on the standard library to avoid re-inventing the wheel and to access system level functionality (most significantly File IO) in a platform independent way.

There isn’t a great workaround for this one. I had to reimplement a lot of functionality in the Standard Library to get around this. Hopefully, it only has to be done once and will eventually ship with a GPU compiler.

File IO was particularly bad and involved doing remote accesses through the CPU. However, I was able to get this working, and it should be possible to push all of the complexity into a low level library that the standard library (FILE and iostream) calls into.

Performance of Single Threaded Code Will Be Too Slow

This is a valid concern, but it is something that I hope future systems will resolve automatically over time (e.g. by running single threads on fast cores and parallel kernels on GPU cores). Today, I can live with the latency needed for a single thread to enter main and launch a parallel kernel.

GPUs Need a CPU For Some Functionality

Surprisingly, this wasn’t an issue at all. It was fairly easy to write a CPU program that loads a GPU binary, looks up main and kicks off a single thread. I did this using Ocelot, but it would be possible to do the same thing using CUDA driver calls. It was so straightforward in fact that I was able to add a build rule to SCONs for GPU applications that builds the device code, embeds it in a vanilla loader program, builds the loader using a native compiler to run on the CPU. When the loader starts, it extracts the GPU binary and kicks off the main function, giving the illusion that it never interacts with the CPU at all.

With an implementation of the Standard Library, there was no need to callback to the CPU to do anything.

Summing It Up

Although there are a few minor issues, it is largely possible to run native C++ applications completely on a GPU without any CPU interaction other than to kick off the first kernel.

Can you point to an example in your repository?

See [url]Google Code Archive - Long-term storage for Google Code Project Hosting. for the entry point “archaeopteryxDriver” could be renamed “main”.

See [url]Google Code Archive - Long-term storage for Google Code Project Hosting. for the host-side glue code that starts up a runtime for handling GPU callbacks, copies command line arguments to the GPU, and starts a thread in “archaeopteryxDriver”.

I may be off here, but is the Microsoft C++ AMP library sort of what you’re looking for?

C++ AMP makes some steps in this direction, but it really is a new language, not C++. It takes the stance that a program always has a CPU, and that memory and ISA heterogeneity should be first class language constructs (restrict and array_view). So there would still be similar issues that I would need to work around.

It doesn’t really address the problem of needing to access system functionality from the GPU. Also, as a practical limitation, it uses DirectCompute as a backend, which makes it harder to do things like nested parallelism.

I think that it is interesting to explore language extensions to address heterogeneity, but I would like to have the option of not using them. I’m hoping that Microsoft will try to fold some of the ideas in C++ AMP back into C++ and flush out their GPU backend to make it as full featured as their CPU backend.

Because of NVIDIA in fact shares such thinking style, I gave up programming with CUDA for awhile like many others, I plan to return to CUDA later. CUDA is future. Maybe in 3 years, maybe in 5, maybe in 55. Instead of implementing features that developers need, they implemented C++ and other nearly useless stuff which is always full of errors.

Lev, could you be more specific about which features that you think developers want, but have not been implemented.

Greg, I was doing some searching and your name popped up when I came across Ocelot. I have some questions about runtime code generation (posted on another thread here) and was wondering if I could pick your brain. Where would be the best place to contact you? Thanks.

https://devtalk.nvidia.com/default/topic/526247/cuda-programming-and-performance/cuda-needs-ipad-ipod-iphone-approach-to-mass-success-software-hardware-integration-and-consumer-ma/

This style of GPU programming reminds me a little bit of the Cell Messaging Layer:

[url]http://www.ccs3.lanl.gov/~pakin/software/cellmessaging/[/url]

This library changed the programming model of a large cluster of Cell processors (like Roadrunner) to be one where the PPEs were not explicitly programmed. Instead everything happened on the SPEs, and SPEs were free to communicate with SPEs on other devices or nodes. CML provided some basic service code that ran on all the PPEs (and host x86 processors) to shuffle the needed messages around.

If the Maxwell series of GPUs get an on-chip supervisory ARM core, then I fully expect the programming model you describe to be even easier to implement. (And the CUDA architecture will resemble Cell a little more…)

@seibert, thanks for the CML pointer, I haven’t been watching Cell development for the last year or
so and I hadn’t seen it yet.

It sounds like it uses the PPE as an IO-Hub or router to avoid implementing the network stack on the SPEs, which does seem very similar to using the CPU core for File-IO and interacting with device drivers. There are some CUDA projects that aim to do similar things, e.g. cudaMPI for message passing or Mark Silberstein’s work on GPUfs for File-IO, but there is still a lot of work to be done.

Hopefully the desire for some of the programming model abstractions that I mentioned will motivate further work on these topics.

@Lev, it sounds like you mainly want a simple and rock-solid environment for writing GPU applications.

This is not a trivial problem by any means, but the computing industry has a long history
of designing software development tools (compilers, IDEs, debuggers, etc) that are
reliable and usable. I think that it will be possible for GPU programming tools that leverage
and expand on these ideas to emerge and continue to mature.

Yes, I think this is what desktop, workstation and mass scientific developers need. Today are different times. CPU performance is very big, much bigger than 15 years ago, would CUDA come that time with 10 times speed up over cpu, it would be different story. Now programming simplicity is important. No time to learn cudamemcpy3d parameters and figuring out why nsight does not work.

@Lev have you tried using the nsight IDE (for Linux/Mac)? It was released with CUDA 5, and worked fine for me without any tweaking necessary, and has all the trappings of modern IDEs (syntax highlighting, autocomplete, etc). it’s basically eclipse CDT with built-in CUDA support.

Its great, good it comes finally with cuda5.0, I do not use linux/mac.

@alrikai do you use cuda for personal use or for mac/linux application?

well, both I guess. For work I mostly use CUDA on linux, for personal use I’ve historically used CUDA on windows. In both work and personal settings I use CUDA for writing (parts of) applications

Who are your customers? Contractor or consumer? What is the market?