I’ve been playing around with a refined model of how to program GPUs and I’m interested in
feedback from others who spend a lot of time doing GPU programming. Would this type of
programming model appeal to you?
Check out the source code http://code.google.com/p/vanaheimr/source/browse that I’m describing.
When most people think of GPU Computing, they think of heterogeneous applications that run on a combination of a CPU and a GPU. This allows programmers to use single threaded algorithms on the CPU and parallel algorithms on the GPU, but they must accept extra complexity associated with compiling for different types of processors with different memory spaces. In general I think that complexity is justified for performance critical applications, but most applications are not performance critical. Much of the time it is acceptable to write an application with good asymptotic algorithmic complexity with performance that scales well on future (faster) hardware. Then, if performance is an issue, specific pieces of an application can be tuned. For these applications the explicit control over CPU and GPU portions of code becomes a handicap because dealing with it is a requirement, not an option. For many applications, I think that it would be more desirable to just think about parallelism, synchronization, and data sharing, which are hard enough problems without heterogeneity.
So what would be ideal? For a C++ enthusiast like me, the best case scenario would be to allow me to write applications in vanilla C++11 with two simple extensions.
- Parallelism (an API call to launch a SPMD kernel over a function pointer).
- Synchronization (an API call for a barrier over SPMD threads in a group).
With these extensions, a C++ program would begin with a single thread entering main, memory would be managed with malloc/new and drawn from a single shared address space, and the use of parallelism (or not) would be up to the programmer. Heterogeneity would be abstracted away and the programmer would be given the perspective that he is running on a single processor with a large number of cores. If the two previously mentioned extensions were implemented in a library, there would be no need for a domain specific compiler - an off-the-shelf C++ compiler would work. More importantly, it would be possible to compile existing libraries as is, and call them from the GPU device.
So why isn’t it possible to do this today?
Well it is possible to do this today and run applications on a CPU. However, that isn’t interesting because it doesn’t allow me to accomplish my primary goal of implementing forward scalable parallel algorithms that run well on systems with hundreds or thousands of cores.
So instead, why isn’t it possible to do this today on a GPU?
Well, it turns out that it is possible to do most of this today if you are willing to be creative in your use of existing tools, but you need to address the following real or perceived issues.
- Compilers that target GPUs don't implement all of C++11.
- GPUs don't allow basic functionality that programs expect that is normally through the standard library (File IO, system calls, etc).
- Performance of single threaded code running on the GPU will be so bad that it drags down the rest of the program.
- GPUs need a CPU for some functionality.
So let me address these one at a time.