clock cycles of double operation

Hi,

You know the sm1.3 can support double operations in kernels. My question is that how many clock cycles should the double operations take, such as add, minus, multiply, divide, sqrt, inverse … (I can not find it in the programming guide, it only has the information for single float)

By the way I’ve another question that when I use the shared memory of double type, will it has the bank conflict. For example, shared double sh[64]; sh[tid] = x; …

My gpu is Gtx280. Thanks a lot!

Minming

I believe the 64-bit FMAD works as fast as the 32-bit one.

If I understand this correctly, the way we calculate peak FLOPS for single precision is to consider a multiply-add (that is being done by the SP’s FMAD) as two operations and add to that a single MUL (that is being performed concurrently by the SFU), thus 3 operations per cycle. With 8 SPs, that’s 24 operations per cycle, per SM. Double precision peak FLOPS is twelve times less than single (78G vs 933G), so that’s 2 operations per cycle, per SM. Since there’s only one 64-bit unit, that evaluates to 2 operations per cycle for this unit. And that must be a single multiply-add operation.

If my FLOPS calculations aren’t off, 64-bit add, sub and mul should take 1 cycle per operation. Multiply-add also takes one cycle (only it’s counted as two operations by benchmark-junkies).

No idea about div, sqrt, inv etc.

what you said is right, with the only correction that it takes 1 cycle per thread. Normally one evaluates # of clock cycles needed for the entire warp, hence it is 32 clock cycles for double precision

ops (as opposed to 4 clock cycles for single-precision arithmetic because there are 8 SPs per SM)

So 32 vs 8. Does it mean double operation is 8 times slower than float operations?

The double prec unit is as fast as the single prec unit in that it can do one operation per cycle per thread (two, if you can MAD as 2 ops). But there are 8 single prec units and only 1 double for each SM, so parallel single precision calculation should be 8 times faster.

Also, single precision computation can be theoretically made concurrent with the use of SFU, giving another MUL per cycle per thread. AFAIK, this mostly happens when your code looks somewhat like this (and if the compiler is in a good mood):

for(; ever;  )

{

  a = a + b*c; //FMAD, this saturates SPs

  e = e*f; //MUL, this goes to SFU, "free MUL"

}

Such code is surprisingly common in graphics, less so in more general programming. But in such rare occasions, you actually get theoretically 12 times more oomph out of the 8 SPs and SFU than from a single, lonely DP unit. The DP unit can’t work concurrently with SFU (or SPs for that matter) so when it works, it works alone (like a hitman!).

Now that’s theory. In practice, you will rarely reach peak SP performance due to memory bandwidth limitation (unless you do hundreds of computations per each memory operation) while it’s much easier to get 95% of the theoretical peak for DP (Volkov shows DGEMM can reach 97% here, while only getting 60% in single precision, so the real difference is closer to 4x in this application). But it’s kinda like saying “DP is comparatively fast because SP is slower than advertised” - you know, it’s all relative ;)

Yes, this seems to be the most likely hypothesis, though NVIDIA does not explicitly say so, nor have I seen microbenchmarks to prove it.

Oh, thanks. But I even do not clearly understand the hardware implementation detail for one instruction. For example, in a kernel, double a = b[tid] * c[tid], (b c is in global memory, a is in register) , how do the SM, SP organize the operation? External Media

What you wrote compiles to more than a single instruction really.

First, the hardware will issue a read that will place the value of b[tid] in a register (implicitly). Then it will do the same for c[tid]. This is because we can’t do arithmetic or logic operations on anything but registers (and perhaps shared memory but I’d bet not). And finally, it will perform a mul on the two new registers. So it basically translates the above into something like this (conceptually):

double a;

double implicitB = load( b[tid] );

double implicitC = load( c[tid] );

a = b * c;

Now, to be fair, something like load( b[tid] ) actually compiles to 3 PTX instructions.

  1. load the address of array b into a register (let’s say r1)

  2. since b[tid] is really the address of b plus the offset given by tid, here’s an addition, and the result of that addition goes to r2.

  3. implicitB = load( r2 );

(and that’s assuming we already have tid in some register)

Here’s a PTX to see how it looks through a looking glass

ld.param.u32 	%r3, [__cudaparm__Z5amulbPfS_S__b];	

	add.u32 	%r4, %r3, %r2;	   

	ld.global.f32 	%f1, [%r4+0];  

	ld.param.u32 	%r5, [__cudaparm__Z5amulbPfS_S__c];

	add.u32 	%r6, %r5, %r2;   

	ld.global.f32 	%f2, [%r6+0];  

	mul.f32 	%f3, %f1, %f2;

First three instructions do the load in the way I just described, that is broken down into:

  1. load the address passed in function parameter into a register (here, into %r3)

  2. add to this our tid (tid sits in %r2) giving %r4 - the address of the specific element from b array that we want to read (The syntax for “x = y + z” is “add.u32 x, y, z”)

  3. read the value pointed by %r4 from global memory into a local floating point register %f1. The [%r4+0] notation means we don’t apply any fixed offset to the addressing.

Then we do the same to c[tid] and finally we do a mul.f32, on those two registers we just read from global mem and giving output to %f3. %f3 is our ‘a’ variable (note, I actually used single precision floats here).

Now this is how it would look in hardware as well if it wasn’t massively parallel. Notice a funny thing in the code above - there’s nothing telling you that there are multiple threads performing this.

It gets slightly more complicated by the fact that we do threads in warps. PTXes are not the actual assembly and this gets later compiled to machine code, during which process some wild things may happen. Perhaps decuda demystifies this but I’ve never toyed with it. PTX is as low as I’m willing to go at this point :)

BTW: If I totally missed the point and gave an unnecessary lecture on assembly coding while you wanted the sub-assembly hardware stuff, then sorry, my bad :)

Big_Mac, your detailed analysis is only true if the variabless in question are global device variables. In that case, the code will certainly be IO bound. The time needed for loads from memory will vastly outweigh the number of instructions, or the number of arithmetic operations per clock cycle, which is the point of this thread.

For problems that are not IO bound, which are the ones where arithmetic throughput matters, the operations must be predominantly operations on registers.

Thank you very much. Your assembly analysis is very helpful. Of course, I’m very interesed to the hardware stuff and it seems some time will be cost for me to understand the multiprocessor/scalar_processor’s working process. :)