There have been several posts on this form ( http://forums.nvidia.com/index.php?showtopic=152580 for example ) arguing for a single compilation chain from CUDA to GPUs and CPUs. My research group has been working on a backend dynamic compiler from CUDA (PTX) applications to several different targets. We are currently ready to release alpha versions of two targets, NVIDIA GPUs and Multi-Core x86 CPUs.
We previously released Ocelot, which at that time consisted mainly of an emulator that could run CUDA programs on CPUs with a high performance overhead. Since that time, we have added two additional targets to Ocelot, x86 CPUs and NVIDIA GPUs, both of which execute native instructions rather than relying on emulation.
This post is to announce an alpha release that is available for download ( http://code.google.com/p/gpuocelot/source/checkout ). We are still working to clean up some of the internals of each target, but we would like to make these tools available to anyone else who might find them useful. Currently we have verified that 132/132 CUDA applications in our test suite correctly execute on the CPU target, and 115/132 applications correctly execute on the GPU target.
Here is a preliminary list of features:
-
All targets are exposed as CUDA devices. To switch between execution on a GPU or a CPU, simply select a different device.
-
A CPU Target:
[*] Multi-core execution. CUDA kernels will be automatically distributed across all CPU cores in a system.
[*] Dynamic optimization. Kernels will be optimized as they are executing.
[*] Support for all CUDA features. This includes textures, opengl, events, streams, malloc array, all memory spaces, etc.
[*] High Performance. Though it may be necessary to hand-optimize the source code, this target can achieve close to the theoretical peak performance of many Multi-Core CPUs. Our internal benchmarks have hit 80% of peak on a Intel Corei7 920.
- A GPU Target:
[*] This is a wrapper around NVIDIA’s JIT compiler that supports dynamic optimization.
[*] Dynamic optimization. Kernels will be optimized as they are executing.
[*] Supports floating contexts. A single host thread can control multiple GPU devices and pointers can be passed from one host thread to another.
- An Emulator Target:
[*] Supports memory bounds checking
[*] Ability to collect detailed performance information as a program is running
Limitations:
-
At this time we only support linux and require a system with gcc-4.2 or later.
-
Support for multi-threaded host applications is buggy when using the GPU target.
-
No support for SSE units on the CPU target as of yet. These should be supported in the next release.
At this time the only version available requires compilation from source and checking out from subversion. As soon as both targets pass our internal regression tests we will do a packaged release as well.
All of this code is released open source under the BSD license, which makes it free for commercial and academic use.
It would really help us out a lot if people could try out running their applications using Ocelot, and report any bugs here: http://code.google.com/p/gpuocelot/issues/list