Ocelot 1.0 Alpha Release High Performance GPU and Multi-core CPU targets

Gregory_Diamos · December 16, 2009, 7:38pm

There have been several posts on this form ( http://forums.nvidia.com/index.php?showtopic=152580 for example ) arguing for a single compilation chain from CUDA to GPUs and CPUs. My research group has been working on a backend dynamic compiler from CUDA (PTX) applications to several different targets. We are currently ready to release alpha versions of two targets, NVIDIA GPUs and Multi-Core x86 CPUs.

We previously released Ocelot, which at that time consisted mainly of an emulator that could run CUDA programs on CPUs with a high performance overhead. Since that time, we have added two additional targets to Ocelot, x86 CPUs and NVIDIA GPUs, both of which execute native instructions rather than relying on emulation.

This post is to announce an alpha release that is available for download ( http://code.google.com/p/gpuocelot/source/checkout ). We are still working to clean up some of the internals of each target, but we would like to make these tools available to anyone else who might find them useful. Currently we have verified that 132/132 CUDA applications in our test suite correctly execute on the CPU target, and 115/132 applications correctly execute on the GPU target.

Here is a preliminary list of features:

All targets are exposed as CUDA devices. To switch between execution on a GPU or a CPU, simply select a different device.
A CPU Target:

[*] Multi-core execution. CUDA kernels will be automatically distributed across all CPU cores in a system.

[*] Dynamic optimization. Kernels will be optimized as they are executing.

[*] Support for all CUDA features. This includes textures, opengl, events, streams, malloc array, all memory spaces, etc.

[*] High Performance. Though it may be necessary to hand-optimize the source code, this target can achieve close to the theoretical peak performance of many Multi-Core CPUs. Our internal benchmarks have hit 80% of peak on a Intel Corei7 920.

A GPU Target:

[*] This is a wrapper around NVIDIA’s JIT compiler that supports dynamic optimization.

[*] Dynamic optimization. Kernels will be optimized as they are executing.

[*] Supports floating contexts. A single host thread can control multiple GPU devices and pointers can be passed from one host thread to another.

An Emulator Target:

[*] Supports memory bounds checking

[*] Ability to collect detailed performance information as a program is running

Limitations:

At this time we only support linux and require a system with gcc-4.2 or later.
Support for multi-threaded host applications is buggy when using the GPU target.
No support for SSE units on the CPU target as of yet. These should be supported in the next release.

At this time the only version available requires compilation from source and checking out from subversion. As soon as both targets pass our internal regression tests we will do a packaged release as well.

All of this code is released open source under the BSD license, which makes it free for commercial and academic use.

It would really help us out a lot if people could try out running their applications using Ocelot, and report any bugs here: http://code.google.com/p/gpuocelot/issues/list

cbuchner1 · December 16, 2009, 7:42pm

Any concerns about nVidia IPR hidden in the CUDA APIs or programming model?

seibert · December 16, 2009, 7:49pm

Wow, I love you guys! I am going to try this out tonight on our code. (I have an algorithm that is super-fast in CUDA, but I can’t get the CPU version very fast at all.)

Gregory_Diamos · December 16, 2009, 7:49pm

Everything was done using publicly available documentation. We do not use any portion of the Open64 compiler or any other tool released by nvidia. For the CUDA runtime, we completely reimplemented it from scratch using only the CudaReferenceManual as a guide.

Jimmy_Pettersson · December 18, 2009, 9:48pm

very interesting! will definetly have a try!

I have another, off-topic question; where you involved in creating the GPU VSIPL++ package that i believe originated from Georgia tech?

I would be interested in seeing some of your performance numbers for FIR (TDFIR) filters for large datasizes. I want to see how my implementation compares :)

Gregory_Diamos · December 19, 2009, 4:59am

I personally wasn’t involved in the development of GPU VISPL, but Andrew Kerr, the other main contributor for Ocelot, worked on GPU VSIPL. I’ll forward your question to him.

tmurray · December 19, 2009, 5:28am

Is this better than compiling with -deviceemu? :P

Very much looking forward to trying this out. Thanks for all your work.

Gregory_Diamos · December 19, 2009, 8:09am

It should be significantly faster, especially for programs with a large number of threads. The current version has about a ~10-20 cycle context switch overhead between threads in the same CTA, which I think was the main problem with deviceemu. You also don’t have to recompile your program to change from execution on a CPU vs a GPU.

On the other hand, you won’t be able to call printf from within a kernel. :)

tmurray · December 19, 2009, 8:49am

Does your CPU path support zero-copy? Could it run cuPrintf?

Gregory_Diamos · December 19, 2009, 9:22am

The CPU path does support zero-copy, although we don’t have any regression tests more complicated than the simpleZeroCopy SDK example. I haven’t looked at cuPrintf in enough detail to say whether or not it would work, but as long as it only uses CUDA API calls internally, it should work.

Jimmy_Pettersson · December 19, 2009, 4:01pm

thanks!

tmurray · December 23, 2009, 7:13pm

Oh hey, this is on Slashdot now. Good job!

Domokoen · December 23, 2009, 8:33pm

You list the library “rt” as a dependency. What library is this? It is hard to look for, as “rt” is quite common in package names/on the internet. Or is this library already installed by default?

Gregory_Diamos · December 24, 2009, 2:23am

These are real-time extensions to linux. Almost all flavors of linux that I am aware of have support for this.

Gregory_Diamos · December 24, 2009, 2:56am

Thanks, I was surprised that it went through. Hopefully this generates some more interest in CUDA and Ocelot.

Gregory_Diamos · December 28, 2009, 7:10pm

UPDATE: There is a tech report available describing the implementation: [url=“http://www.cercs.gatech.edu/tech-reports/tr2009/abstracts/18.html”]http://www.cercs.gatech.edu/tech-reports/t...stracts/18.html[/url]

as well as some preliminary performance numbers: [url=“http://www.gdiamos.net/files/cpusAndGpus.png”]http://www.gdiamos.net/files/cpusAndGpus.png[/url] (log scale warning)

Nehalem: Intel Core i7 920
Phenom: AMD Phenom 9550
Atom: Intel Atom N270

tmurray · December 28, 2009, 9:11pm

Hmmm, I’m sure I’ve asked this before, but Ocelot does support the driver API too, right?

tmurray · December 28, 2009, 10:13pm

ps–that was a good paper. If you’re still going to work on Ocelot, I might make a few completely ridiculous feature requests…

Jimmy_Pettersson · December 28, 2009, 10:41pm

CUDA for GPUs, FPGAs, and now CPUs :)

It’s -24 C outside so this paper can be a good holiday diversion!

Gregory_Diamos · December 29, 2009, 1:05am

There is not currently a driver level api implementation in Ocelot. It would be possible to add one in the future without too much effort (the implementation of the CUDA runtime is about 2-3k lines and I wouldn’t expect the driver level api to be much more complex than this), but we don’t have anyone actively working on it.

Topic		Replies	Views
PTX Emulator Released CUDA Programming and Performance	32	8777	July 15, 2009
Ocelot 1.1.560 Released An open-source reimplementation of CUDA for GPUs and CPUs CUDA Programming and Performance	7	2267	May 3, 2010
Ocelot - Finding the PTX (Cat) inside the executable (Bag) Is Ocelot Dependent on the CUDA version? CUDA Programming and Performance	29	11777	October 8, 2010
NVIDIA has hade a huge mistake with HW debugger Single-GPU debugging not supported and no emulation& CUDA Programming and Performance	34	6497	August 7, 2010
Ocelot PTX Debugger CUDA Programming and Performance	5	8101	July 23, 2010
Is emulation mode removed from CUDA 3.0? CUDA Programming and Performance	23	22846	July 3, 2010
Ocelot Pre-Release CUDA Programming and Performance	1	933	February 8, 2011
CUDA/PTX Emulator Would Anyone Be Interested? CUDA Programming and Performance	22	10027	June 25, 2013
Simple Question! Can CUDA code be run on CPU CUDA Programming and Performance	9	30267	October 19, 2023
Ability to run PTX directly CUDA Programming and Performance	2	4452	November 11, 2009

Ocelot 1.0 Alpha Release High Performance GPU and Multi-core CPU targets

Related topics