Ocelot 1.0 Alpha Release High Performance GPU and Multi-core CPU targets

Everything was done using publicly available documentation. We do not use any portion of the Open64 compiler or any other tool released by nvidia. For the CUDA runtime, we completely reimplemented it from scratch using only the CudaReferenceManual as a guide.

very interesting! will definetly have a try!

I have another, off-topic question; where you involved in creating the GPU VSIPL++ package that i believe originated from Georgia tech?

I would be interested in seeing some of your performance numbers for FIR (TDFIR) filters for large datasizes. I want to see how my implementation compares :)

I personally wasn’t involved in the development of GPU VISPL, but Andrew Kerr, the other main contributor for Ocelot, worked on GPU VSIPL. I’ll forward your question to him.

Is this better than compiling with -deviceemu? :P

Very much looking forward to trying this out. Thanks for all your work.

It should be significantly faster, especially for programs with a large number of threads. The current version has about a ~10-20 cycle context switch overhead between threads in the same CTA, which I think was the main problem with deviceemu. You also don’t have to recompile your program to change from execution on a CPU vs a GPU.

On the other hand, you won’t be able to call printf from within a kernel. :)

Does your CPU path support zero-copy? Could it run cuPrintf?

The CPU path does support zero-copy, although we don’t have any regression tests more complicated than the simpleZeroCopy SDK example. I haven’t looked at cuPrintf in enough detail to say whether or not it would work, but as long as it only uses CUDA API calls internally, it should work.


Oh hey, this is on Slashdot now. Good job!

You list the library “rt” as a dependency. What library is this? It is hard to look for, as “rt” is quite common in package names/on the internet. Or is this library already installed by default?

These are real-time extensions to linux. Almost all flavors of linux that I am aware of have support for this.

Thanks, I was surprised that it went through. Hopefully this generates some more interest in CUDA and Ocelot.

UPDATE: There is a tech report available describing the implementation: http://www.cercs.gatech.edu/tech-reports/t…stracts/18.html

as well as some preliminary performance numbers: http://www.gdiamos.net/files/cpusAndGpus.png (log scale warning)

Nehalem: Intel Core i7 920
Phenom: AMD Phenom 9550
Atom: Intel Atom N270

Hmmm, I’m sure I’ve asked this before, but Ocelot does support the driver API too, right?

ps–that was a good paper. If you’re still going to work on Ocelot, I might make a few completely ridiculous feature requests…

CUDA for GPUs, FPGAs, and now CPUs :)

It’s -24 C outside so this paper can be a good holiday diversion!

There is not currently a driver level api implementation in Ocelot. It would be possible to add one in the future without too much effort (the implementation of the CUDA runtime is about 2-3k lines and I wouldn’t expect the driver level api to be much more complex than this), but we don’t have anyone actively working on it.

Feel free to make any suggestions, we would welcome any input that you have.

I am planning on working directly on Ocelot for the remainder of my time at Georgia Tech (1-1.5 years). I think that Andrew Kerr, the other main contributor is as well. We are also starting a few side projects next semester that will add be headed by other PhD students working on CUDA-related topics that will add features to Ocelot.

Nice to see your project improving rapidly…

Just a few questions and remarks:

Back when you were working on the Cell backend, you generated SIMD instructions and handled branch divergence in software, right? Do you plan to do so with this translator?

I think LLVM supports vector instructions and registers.

Since most CUDA codes are data-parallel programs already optimized for SIMD execution, and the hardware industry is heading for general-purpose cores with wide SIMD extensions, I believe that makes an interesting research direction (and just figuring the best way to implement branches and predication should keep a few PhD students busy for some time ;)).

You observe that strided accesses are much slower than sequential accesses on the CPU. Do you think it would be possible to detect at least some coalesced memory accesses in the PTX code through static analysis, and then translate them into sequential/vector loads and stores on the CPU side?

I don’t think your implementation of rounding works as it stands. Think of what happens at a midpoint between two integers. Also, cvt.rni.f32.f32 need to work with big numbers too. My suggestion is to implement all conversions as library functions based on nearbyint() and lrint(), as you already do in the emulator, and list that among the “supported in hardware on the GPU but not the CPU” stuff.

What was the range of the random inputs for the special function throughput benchmarks?

Interesting that rsqrt ends up being faster than sqrt even for scalar code…

In my opinion, the ultimate CUDA->CPU translator should:

  • take advantage of SIMD instructions when possible and efficient and select the appropriate SIMD width,

  • figure out memory access patterns to emit the most efficient memory instructions,

  • provide a target-dependent library of data-parallel functions such as reduction and scan, math functions and such. I think not allowing this is the most prevalent limitation of PTX at the moment.

  • interleave instructions from various threads to reduce pipeline hazards
  • use the FP_MUL , FP_ADD exec units simulataneously by scheduling the instructions smartly.