I was trying to see if I reached a logical compute ceiling for one of my kernels. I started by looking at unoptimized PTX code, and isolated the part that executes most often, usually thousands of times per thread per kernel call. The overhead provided by the rest of the kernel is negligible compared to the execution time of this part. I should also mentione that the sequence I’m looking at is part of an unrolled loop. This is rather long, so please bear with me.
This is one iteration of the unoptimized PTX output of nvcc:
ld.shared.f32 %f14, [%r32+112];
ld.shared.f32 %f15, [%r32+116];
ld.shared.f32 %f16, [%r32+120];
ld.shared.f32 %f17, [%r32+124];
mov.f32 %f35, 0f5005ecca; // 8.98755e+09
mul.f32 %f19, %f17, %f35;
sub.f32 %f20, %f9, %f15;
sub.f32 %f21, %f8, %f14;
sub.f32 %f22, %f3, %f16;
mul.f32 %f23, %f20, %f20;
mad.f32 %f24, %f21, %f21, %f23;
mad.f32 %f25, %f22, %f22, %f24;
sqrt.approx.f32 %f26, %f25;
mul.f32 %f27, %f25, %f26;
div.approx.f32 %f28, %f19, %f27;
mad.f32 %f7, %f21, %f28, %f7;
mad.f32 %f6, %f20, %f28, %f6;
mad.f32 %f5, %f22, %f28, %f5;
To simplify calculations, I will be looking at the number of cycles needed to execute an instruction from the viewpoint of a thread, rather than the entire warp, as given in the programming guide.
We have 18 floating point operations or FLOP.
Looking at the shared memory loads, we see that we have 4-way bank conflicts, so each access should take four clock cycles.
According to the programming guide, throughput of square root is one instruction per cycle for a warp, or from the point of view of a thread, 8 clock cycles. Also since division is implemented as multiplication by the reciprocal, from the viewpoint of a thread, we have 4 cycles for the reciprocal, and one cycle for the multiplication, for a total of 5 cycles.
Thus, in theory we shoud have 16 + 8 + 5 + 12 = 41 cycles to perform 18 floating point operations. Two 9800GTs will execute 2 * 1.62 * 112 Gigacycles per second; if then for every 41 cycles we have 18 floating point operations, we should get a throughput of 2 * 1.62 * 112 * 18 / 41 = 159 GFLOP/s. Even if the constant load is optimized out, we should still expect only 163 GFLOP/s.
However, the performance of the code in this configuration is 408 to 425 GFLOP/s. That is equivalent to the 18 floating point operations performed in only (get ready for weird) 15.5 to 16 clock cycles. I’ve tried replacing both the square root and division by multiplications, in which case performance reached about 495 GFLOP/s, or around 13 clock cycles to do the 18 operations.
This takes me to my first assumption. We have exactly 13 instructions that perform meaningful floating point arithmetic. In the case when we replace the sqrt and division with multiplication, we need exactly 13 clock cycles to do the work. This means that “mov.f32 %f35, 0f5005ecca;†and the shared loads must have been optimized out. In other words, since all the data is already loaded into the registers, the assembler has optimized the code by making a thread perform operations using registers that “belong†to other threads.This is theoretically feasible since the array in question is of constant size (512 bytes), and fits in the register space.
The second question relates to exactly how many clock cycles it takes for the division and square root to execute. By converting either of the square root or division to a multiplication, performance jumps to 358 GFLOP/s, or the equivalent of executing the code in 14.25 cycles. So both the square root and division take an extra 1.25 cycles to execute when compared to regular arithmetic instructions. This indeed accounts for the extra 2.5 to 3 cycles that we’re seeing, but in no way explains why div and sqrt execute so quickly, in 2.25 clock cycles as opposed to 9 or 8 respectively. For the entire warp, executing a sqrt or div takes only 9 clock cycles. I find these numbers a little “disturbingâ€.
This takes me to the questions I’m trying to ask:
a ) Is the compiler smart enough, and is it feasible hardware-wise to make threads access registers “owned†by other threads?
b ) What other optimizations are happening that make division and square root much faster than their specified speed?
I’m asking these questions because these are the most agressive optimizations I have ever witnessed. In fact, the code performs faster than the logical compute ceiling (if we are to strictly follow the programming guide), the only other theoretical optimizations being to make sqrt and div execute in one clock cycle. Intel had 30+ years to twiddle its thumbs with x86 and is nowhere near being halfway capable of such efficiency.