Ocelot 1.0 Alpha Release High Performance GPU and Multi-core CPU targets

Feel free to make any suggestions, we would welcome any input that you have.

I am planning on working directly on Ocelot for the remainder of my time at Georgia Tech (1-1.5 years). I think that Andrew Kerr, the other main contributor is as well. We are also starting a few side projects next semester that will add be headed by other PhD students working on CUDA-related topics that will add features to Ocelot.

Nice to see your project improving rapidly…

Just a few questions and remarks:

Back when you were working on the Cell backend, you generated SIMD instructions and handled branch divergence in software, right? Do you plan to do so with this translator?

I think LLVM supports vector instructions and registers.

Since most CUDA codes are data-parallel programs already optimized for SIMD execution, and the hardware industry is heading for general-purpose cores with wide SIMD extensions, I believe that makes an interesting research direction (and just figuring the best way to implement branches and predication should keep a few PhD students busy for some time ;)).

You observe that strided accesses are much slower than sequential accesses on the CPU. Do you think it would be possible to detect at least some coalesced memory accesses in the PTX code through static analysis, and then translate them into sequential/vector loads and stores on the CPU side?

I don’t think your implementation of rounding works as it stands. Think of what happens at a midpoint between two integers. Also, cvt.rni.f32.f32 need to work with big numbers too. My suggestion is to implement all conversions as library functions based on nearbyint() and lrint(), as you already do in the emulator, and list that among the “supported in hardware on the GPU but not the CPU” stuff.

What was the range of the random inputs for the special function throughput benchmarks?

Interesting that rsqrt ends up being faster than sqrt even for scalar code…

In my opinion, the ultimate CUDA->CPU translator should:

  • take advantage of SIMD instructions when possible and efficient and select the appropriate SIMD width,

  • figure out memory access patterns to emit the most efficient memory instructions,

  • provide a target-dependent library of data-parallel functions such as reduction and scan, math functions and such. I think not allowing this is the most prevalent limitation of PTX at the moment.

  • interleave instructions from various threads to reduce pipeline hazards
  • use the FP_MUL , FP_ADD exec units simulataneously by scheduling the instructions smartly.

Thats great!! :-)

BTW what is the reason behind the name Ocelot…

We are currently working on this, but it is not at all straightforward what the best approach here would be. Should all nested branches be completely unrolled and converted into predication? What should the warp size be? If there is a divergent branch, should inactive threads be handled using predication, or by executing a different version of the program that has a narrower SIMD width?

I think that it would be possible to detect coalesced accesses where a single variable was directly derived from a special register (thread id) that was used as an offset to a memory access. Detecting them in the general case would be much more difficult. Once an access as been classified as ‘strided’, converting it to a sequence access on the CPU would be difficult because each data element must end up in the corresponding thread. Off the top of my head, it seems like it would be possible to add a context switch point immediately before and after the access. This would cause the threads to execute their access one after another in a sequence, but it would introduce some context switch overhead. Most of the context switch overhead could probably be mitigated by only selectively saving and restoring registers that were needed for the access.

It absolutely is not bit-accurate (or even IEEE compliant). The idea behind the current implementation was to provide something that compromised between accuracy and performance. There are portable ways to change the default behaviour of the floating point units on CPUs, as we do in the emulator, but it would require a library call before and potentially after instructions that use a non-default rounding mode. I thought that this overhead was not justified for an operation that could be mapped to a single instruction x86 instruction. Out of all the applications that we tested, none of the results were affected except for the SDK dxtc example, where about 10 out of 307200 pixels were off by one RGB value. It would be relatively simple to add support for both modes and have a configuration option select which mode is used. I’ll add that as a feature request.

Going back and looking in detail, the range was (0.0-1.0]. I think that both of these functions use piece-wise polynomial approximation, so the range of inputs is very important. I should probably re-run the experiment with a wider range or at least different sub-ranges.

I completely agree with these. SIMD support is very high on our priority list right now. Memory accesses are important as well, though I think we need to think about the problem in more detail before we can implement anything.

I think that this would have to be handled at a higher level than PTX as there are many different ways to implement scan or reduction, even on the same hardware. For example, a library like thrust or CUPP providing a reduction function could query the device type before selecting an implementation.

No significance at all. I personally prefer names that are memorable and are known widely enough such that most people will have heard them before rather than arduously contrived acronyms.

We have certainly thought about this, but it creates a trade off between register file pressure (x86 processors are relatively register starved) and pipeline hazards. This is an optimization that is completely ignored by most compilers though, so there is still room for exploration.

The LLVM code generator handles instruction scheduling for us. I am hoping that it does things like this.

A first step would be to implement just like it’s done on current GPUs: with opportunistic branching when the condition is uniform, and predication otherwise. I believe a simple implementation based on activity counters [ref] should already perform well, at least for CUDA apps optimized for low divergence.

Selecting the best warp size is another problem, but once it’s implemented it should just be a matter of changing a parameter.

Rounding is needed in 3 cases:

  • Arithmetic operations. The default rounding mode is round-to-nearest-even for both the GPU and the CPU, so no overhead in this case. I would assume that if the programmer went through the trouble of using intrinsics to select another rounding mode than the default, then they really mean it and it should be implemented as such.

Anyway, I would be very interested in hearing about a CUDA/OpenCL app that uses a non-default rounding mode for arithmetic operations (I’m not just being sarcastic, I’m actually interested.)

  • Conversion from float to int, rounded to the nearest.

Fortunately, this maps to a single x86 instruction. Unfortunately, there’s no such instruction in LLVM. But the rint(f) function of C99 does just that and is usually implemented as inline assembly (hopefully inlined). As far as I know such conversion is used in CUDA only inside the range reduction code of trigonometric functions from the math library.

  • C-style cast from float to int, rounded toward zero.

Fortunately, this maps to a single instruction in LLVM. Unfortunately, x86 implementations (like this for instance) are terrible :) (at least before SSE3). But this is LLVM’s problem, not yours…

So in my opinion there is no need for a configuration option or try to cheat, as long as doing things correctly doesn’t cause any benchmark to run slower… ;)

That’s OK, it probably won’t change the result unless you select very large numbers (for trig functions).

The reason why rsqrt (most likely implemented using Newton-Raphson iteration) is faster than sqrt (supported by the hardware divider) is probably because the latter is not pipelined. So sqrt has a better latency, but worse throughput, and software beats hardware…

Maybe a better example is the math library. Currently, if I want efficient code for both the Tesla and Fermi architectures, I need to make nvcc generate two separate PTX implementations (because Fermi has a single-precision FMA and Tesla doesn’t, so the optimal implementation of math functions is pretty different). Now if I want to target x86 CPUs, I’ll need yet another PTX code obtained by linking with yet another math library… Doesn’t that defeat the purpose of a retargetable intermediate langage?

… and those who haven’t heard them before will look it up on Wikipedia and improve their general knowledge. I’m delighted to know that Salvador Dalí used to have a pet ocelot. :)