Looking for logical compute ceiling Found magic CUDA optimizations

I was trying to see if I reached a logical compute ceiling for one of my kernels. I started by looking at unoptimized PTX code, and isolated the part that executes most often, usually thousands of times per thread per kernel call. The overhead provided by the rest of the kernel is negligible compared to the execution time of this part. I should also mentione that the sequence I’m looking at is part of an unrolled loop. This is rather long, so please bear with me.

This is one iteration of the unoptimized PTX output of nvcc:

ld.shared.f32	 %f14, [%r32+112];

 ld.shared.f32	 %f15, [%r32+116];

 ld.shared.f32	 %f16, [%r32+120];

 ld.shared.f32	 %f17, [%r32+124];

 mov.f32	 %f35, 0f5005ecca;		// 8.98755e+09

 mul.f32	 %f19, %f17, %f35;

 sub.f32	 %f20, %f9, %f15;

 sub.f32	 %f21, %f8, %f14; 

 sub.f32	 %f22, %f3, %f16;

 mul.f32	 %f23, %f20, %f20;

 mad.f32	 %f24, %f21, %f21, %f23;

 mad.f32	 %f25, %f22, %f22, %f24;

 sqrt.approx.f32	 %f26, %f25;

 mul.f32	 %f27, %f25, %f26;

 div.approx.f32	 %f28, %f19, %f27;

 mad.f32	 %f7, %f21, %f28, %f7;

 mad.f32	 %f6, %f20, %f28, %f6;

 mad.f32	 %f5, %f22, %f28, %f5;

To simplify calculations, I will be looking at the number of cycles needed to execute an instruction from the viewpoint of a thread, rather than the entire warp, as given in the programming guide.

We have 18 floating point operations or FLOP.

Looking at the shared memory loads, we see that we have 4-way bank conflicts, so each access should take four clock cycles.

According to the programming guide, throughput of square root is one instruction per cycle for a warp, or from the point of view of a thread, 8 clock cycles. Also since division is implemented as multiplication by the reciprocal, from the viewpoint of a thread, we have 4 cycles for the reciprocal, and one cycle for the multiplication, for a total of 5 cycles.

Thus, in theory we shoud have 16 + 8 + 5 + 12 = 41 cycles to perform 18 floating point operations. Two 9800GTs will execute 2 * 1.62 * 112 Gigacycles per second; if then for every 41 cycles we have 18 floating point operations, we should get a throughput of 2 * 1.62 * 112 * 18 / 41 = 159 GFLOP/s. Even if the constant load is optimized out, we should still expect only 163 GFLOP/s.

However, the performance of the code in this configuration is 408 to 425 GFLOP/s. That is equivalent to the 18 floating point operations performed in only (get ready for weird) 15.5 to 16 clock cycles. I’ve tried replacing both the square root and division by multiplications, in which case performance reached about 495 GFLOP/s, or around 13 clock cycles to do the 18 operations.

This takes me to my first assumption. We have exactly 13 instructions that perform meaningful floating point arithmetic. In the case when we replace the sqrt and division with multiplication, we need exactly 13 clock cycles to do the work. This means that “mov.f32 %f35, 0f5005ecca;” and the shared loads must have been optimized out. In other words, since all the data is already loaded into the registers, the assembler has optimized the code by making a thread perform operations using registers that “belong” to other threads.This is theoretically feasible since the array in question is of constant size (512 bytes), and fits in the register space.

The second question relates to exactly how many clock cycles it takes for the division and square root to execute. By converting either of the square root or division to a multiplication, performance jumps to 358 GFLOP/s, or the equivalent of executing the code in 14.25 cycles. So both the square root and division take an extra 1.25 cycles to execute when compared to regular arithmetic instructions. This indeed accounts for the extra 2.5 to 3 cycles that we’re seeing, but in no way explains why div and sqrt execute so quickly, in 2.25 clock cycles as opposed to 9 or 8 respectively. For the entire warp, executing a sqrt or div takes only 9 clock cycles. I find these numbers a little “disturbing”.

This takes me to the questions I’m trying to ask:

a ) Is the compiler smart enough, and is it feasible hardware-wise to make threads access registers “owned” by other threads?

b ) What other optimizations are happening that make division and square root much faster than their specified speed?

I’m asking these questions because these are the most agressive optimizations I have ever witnessed. In fact, the code performs faster than the logical compute ceiling (if we are to strictly follow the programming guide), the only other theoretical optimizations being to make sqrt and div execute in one clock cycle. Intel had 30+ years to twiddle its thumbs with x86 and is nowhere near being halfway capable of such efficiency.

The throughput numbers you cite are per warp, not per CUDA core. As each multiprocessor executing 1 warp = 32 threads has only 8 cores, the throughput of your 9800GT cards is 4 times of what you thought.

Also, the special function units operate in parallel to the FP units, so that the 2 special functions used in your example hide the cycles used for 2 FP operations.

On the other hand, some instructions take more cycles than you calculated. Loads from shared memory take 6 cycles (per warp).
Special functions take 16 cycles (per warp). A square root is compiled to two of them (inverse sqrt and div, as in your example), or 32 cycles per warp. This is where the 1 cycle per thread comes from that you have cited.

I seem to have not made the distionction between ‘instructions per cycle’ and ‘cycles per instruction’ very clear. Allow me to explain.

The data I cite is converted to per thread from the per warp numbers. If we have 8 instructions per cycle for the warp, the MP needs 32/8 = 4 cycles to generate 32 instructions for the 32 threads in a warp. Thats is one physical cycle corresponding to each instruction, or a 1 to 1 ration from the point of view of a thread. Something that executes at a rate of 2 instructions per cycle take 4 times as long, or if we run the calculation above, in our abstract units of clock cycles for a thread we get 4. And so on…

The programming guide states that the special function units are used for transcendetals; other than that, there’s not much information on the SFUs. Since reciprocal, inverse square root, and square root are not transcentals, I assumed they were handled by the SPs. Assuming the SFU do those, and that the SFU’s take one clock cycle to execute any of those instructions: With 2 SFUs, we need 16 warp cycles to generate the instructions for the entire thread. In our abstract per thread clock cycles that comes to 4 (just what I calculated initially).

Where did you get that number?

32warps/16(warp cycles) = 2 instr per cycle, or 4 absact thread cycles for instruction

According to the programming guide,

“[font=“Garamond”][font=“Garamond”]Single-precision floating-point square root is implemented as a reciprocal square root followed by a reciprocal instead of a reciprocal square root followed by a multiplication, so that it gives correct results for 0 and infinity. Therefore, its throughput is 1 operation per clock cycle.[/font][/font]”

That comes to the 8 abstract thread cycles for instruction I have cited.

Anyway, I see now how execution of other warps hides the abstract thread clock cycles from measurement. This should explain why I’m seeing the same latency for both div and sqrt. It simply means that the SFUs are used too often to have enough enogh regular instructions to cover the abstract latencies. That means that I have reached indeed a logical theoretical compute ceiling.

Of course, this still leaves question ‘a’ open.

Hmm… I think your results might be best explained by the fact that the SFU is completely parallel to the rest of the SP. This means that the MP can actually perform 9 FLOPs per clock. Since you’re doing 2 SFU ops and 16 SP ops, you should get perfect utilization of the MP, which should have a throughput of 1/4th of a warp every 16 cycles for 18 FLOPs per thread. This is the “missing mul” everyone talks about.

It might also be able to fuse some of those sub’s and mul’s into mad’s, depending on whether it can flip the sign of any of the appropriate registers. And yes, it could very well optimize out the shared loads, if it can determine that nothing is writing to that memory space in that timespan.

Ah ok, if you prefer those units.

I assume they are done on the SFUs because they are a perfect fit (I seem to remember that they do all operations by piecewise quadratic approximation, but I have no reference at hand). The adder and multiplication net from the SP, on the other hand, are of little use for doing a fast division or inverse square root. Of course, this is all guessing as it is not well documented.

It’s in Vasily Volkov’s paper “Benchmarking GPUs to Tune Dense Linear Algebra” (see his website). It also matches my own measurements.

Yes, we agree here.

As far as I know, current hardware is not able to do any “horizontal” operations, e.g. access other thread’s registers. This would require additional data paths that are just not present in the hardware (I’d love to be proven wrong here and learn how to do it. But I’m pretty convinced of this as it would consume quite some die area and be of zero use for graphics).

Wait, it’s two special function units per SM. In the very rare case that all are occupied at all times get a peak throughput of (8 * 2 FLOPs (mad) + 2 FLOP (sfu))*nSM * clock; That comes to 408 GFLOP/s for a 9800GT at 1.62GHz.

Not in this case though; I checked both the C code and PTX code and it’s not doable. I was able to change the division and sqrt to multiplication by reciprocal square root. Absolute peak throughput is 464 GFLOP/s, which is close 18 FLOPs in 14 clock cycles, so I think I finally reached the ceiling.

It’s what my measurements indicate.

I agree. You could reuse most of the tranzistor logic to do transcendentals, rsqrt, reciprocal, etc. It makes sense to only have two such units because of their complexity, yet include eight simpler SPs.

I did not mean to say the hardware does operation with registers from a different SP, but with registers from the same SP that have been written by other threads, in an akwardly similar manner to shared memory.

Of course, acoording to the programming guide:

"[font=“Garamond”][font=“Garamond”]each scalar thread executes independently with its own instruction address and

register state."

[/font][/font]If the statement is from the point of view of the hardware, then it shouldn’t be possible to do that, but if it’s from the point of view of the programming model, there’s nothing immediately obvious that prevents these optimizations in hardware. I do not know what happens, but I would bet a hefty sum that those shared load instructions from the PTX code do not execute.

On a side note, it’s interesting that the compiler/assembler do heavy duty low-level optimizations, but they fail do detect easy to optimize higher level constructs, such as division by a product that contains a sqrt, like the following: (a/(b * sqrt( c ) ), which is equivalent to the much faster a/b * rsqrt( c ), eliminating one reciprocal operation.

Nvidia usually claims that each core can do a mad in the SP and a mul in to SFU, thus three FLOP per cycle per core or 544 GFLOP/s for the 9800GT at 1.62GHz. That seems to fit well with your observations.

Ah, that’s an interesting concept! Something like an out-of-bounds register access.

However, I assume that the different warps of a thread have different, free-running program counters, as otherwise there there would be little need for the syncthreads() barrier instruction. And we know that one is for real since there are so many people over running into problems with different threads waiting in different syncthreads(). External Media

Then reuse of the same data over different warps would need extra synchronization (where the compiler would have to check whether the gain is worth the extra barrier) and I somehow doubt the compiler does that (although it certainly seems doable and not very difficult actually). Speculations, speculations…

Would be interesting to see. Can you check with decuda?

I agree.

My guess would be that the split of sqrt() into rsqrt() and reciprocal happens after the high-level arithmetic optimizations, and that the optimizing “assembler” does not do those again.

On a side note, I find it interesting that similar examples often seem to be cited among the primary examples for optimization, yet compilers (at least gcc) have traditionally been hesitant to emit those.

It doesn’t add up to 3 cycles per SP for the entire MP. It’s at most 8MADs on the SP and 2 MULs on the SFU, so that’s 18 FLOPS per MP or 2.25 FLOP per cycle per SP (not the 3 the marketing team liked to use). Now that’s for compute 1.0 → 1.1. According to some GT200 architecture reviews, and TMurray (can’t find exact post) the GT200 and later can do the MAD and MUL on the SP.

It’s definitely a good point of optimization for some algorythms. Basically It’s a great point of optimization for any algorythm that traverses array B to compute an element in array A.

I tried and failed. Either I’m a complete noob, or decuda doesn’t like files generated with CUDA 3.0. I’m attaching the cubin file should you wish to put it through decuda.

Damn, something should look at and optimize such things. I gained a 30% speedup (110 GFLOP/s on my hardware) by reorganizing three or four basic operations. So before that, the compiler was losing more than the performance equivalent of a dual-Xeon system for not optimizing out a couple of redundant reciprocals. Maybe I should mention it on the “I’d like to do this but I can’t” thread.

Intel’s compiler is very good at doing optimizations. Up until yesterday, I was using the exact code for the functor doing the calculations. However I moved the operands, didn’t change cpu performance at all. I remember seeing rsqrtss somewhere in the disassembled x86 code, though my code only has sqrt();

About GCC, oh don’t get me started on that. With all due respect to anyone using GCC, anyone working on GCC, the free software community, and the Free Software Foundation, I dubbed GCC “The headache compiler”, though I admit it’s produces code a tad bit faster than Microsoft’s.
cubin.zip (5.04 KB)