clock cycles of double operation

cmmfir · April 20, 2009, 3:42am

Hi,

You know the sm1.3 can support double operations in kernels. My question is that how many clock cycles should the double operations take, such as add, minus, multiply, divide, sqrt, inverse … (I can not find it in the programming guide, it only has the information for single float)

By the way I’ve another question that when I use the shared memory of double type, will it has the bank conflict. For example, shared double sh[64]; sh[tid] = x; …

My gpu is Gtx280. Thanks a lot!

Minming

_Big_Mac · April 20, 2009, 9:43am

I believe the 64-bit FMAD works as fast as the 32-bit one.

If I understand this correctly, the way we calculate peak FLOPS for single precision is to consider a multiply-add (that is being done by the SP’s FMAD) as two operations and add to that a single MUL (that is being performed concurrently by the SFU), thus 3 operations per cycle. With 8 SPs, that’s 24 operations per cycle, per SM. Double precision peak FLOPS is twelve times less than single (78G vs 933G), so that’s 2 operations per cycle, per SM. Since there’s only one 64-bit unit, that evaluates to 2 operations per cycle for this unit. And that must be a single multiply-add operation.

If my FLOPS calculations aren’t off, 64-bit add, sub and mul should take 1 cycle per operation. Multiply-add also takes one cycle (only it’s counted as two operations by benchmark-junkies).

No idea about div, sqrt, inv etc.

asm · April 20, 2009, 10:35am

I believe the 64-bit FMAD works as fast as the 32-bit one.

If I understand this correctly, the way we calculate peak FLOPS for single precision is to consider a multiply-add (that is being done by the SP’s FMAD) as two operations and add to that a single MUL (that is being performed concurrently by the SFU), thus 3 operations per cycle. With 8 SPs, that’s 24 operations per cycle, per SM. Double precision peak FLOPS is twelve times less than single (78G vs 933G), so that’s 2 operations per cycle, per SM. Since there’s only one 64-bit unit, that evaluates to 2 operations per cycle for this unit. And that must be a single multiply-add operation.

If my FLOPS calculations aren’t off, 64-bit add, sub and mul should take 1 cycle per operation. Multiply-add also takes one cycle (only it’s counted as two operations by benchmark-junkies).

No idea about div, sqrt, inv etc.

what you said is right, with the only correction that it takes 1 cycle per thread. Normally one evaluates # of clock cycles needed for the entire warp, hence it is 32 clock cycles for double precision

ops (as opposed to 4 clock cycles for single-precision arithmetic because there are 8 SPs per SM)

cmmfir · April 21, 2009, 2:30pm

So 32 vs 8. Does it mean double operation is 8 times slower than float operations?

_Big_Mac · April 21, 2009, 6:50pm

The double prec unit is as fast as the single prec unit in that it can do one operation per cycle per thread (two, if you can MAD as 2 ops). But there are 8 single prec units and only 1 double for each SM, so parallel single precision calculation should be 8 times faster.

Also, single precision computation can be theoretically made concurrent with the use of SFU, giving another MUL per cycle per thread. AFAIK, this mostly happens when your code looks somewhat like this (and if the compiler is in a good mood):

for(; ever;  )

{

  a = a + b*c; //FMAD, this saturates SPs

  e = e*f; //MUL, this goes to SFU, "free MUL"

}

Such code is surprisingly common in graphics, less so in more general programming. But in such rare occasions, you actually get theoretically 12 times more oomph out of the 8 SPs and SFU than from a single, lonely DP unit. The DP unit can’t work concurrently with SFU (or SPs for that matter) so when it works, it works alone (like a hitman!).

Now that’s theory. In practice, you will rarely reach peak SP performance due to memory bandwidth limitation (unless you do hundreds of computations per each memory operation) while it’s much easier to get 95% of the theoretical peak for DP (Volkov shows DGEMM can reach 97% here, while only getting 60% in single precision, so the real difference is closer to 4x in this application). But it’s kinda like saying “DP is comparatively fast because SP is slower than advertised” - you know, it’s all relative ;)

seibert · April 21, 2009, 6:57pm

Yes, this seems to be the most likely hypothesis, though NVIDIA does not explicitly say so, nor have I seen microbenchmarks to prove it.

cmmfir · April 22, 2009, 3:05pm

The double prec unit is as fast as the single prec unit in that it can do one operation per cycle per thread (two, if you can MAD as 2 ops). But there are 8 single prec units and only 1 double for each SM, so parallel single precision calculation should be 8 times faster.

Also, single precision computation can be theoretically made concurrent with the use of SFU, giving another MUL per cycle per thread. AFAIK, this mostly happens when your code looks somewhat like this (and if the compiler is in a good mood):
for(; ever;  )

{

  a = a + b*c; //FMAD, this saturates SPs

  e = e*f; //MUL, this goes to SFU, "free MUL"

}
Such code is surprisingly common in graphics, less so in more general programming. But in such rare occasions, you actually get theoretically 12 times more oomph out of the 8 SPs and SFU than from a single, lonely DP unit. The DP unit can’t work concurrently with SFU (or SPs for that matter) so when it works, it works alone (like a hitman!).

Now that’s theory. In practice, you will rarely reach peak SP performance due to memory bandwidth limitation (unless you do hundreds of computations per each memory operation) while it’s much easier to get 95% of the theoretical peak for DP (Volkov shows DGEMM can reach 97% here, while only getting 60% in single precision, so the real difference is closer to 4x in this application). But it’s kinda like saying “DP is comparatively fast because SP is slower than advertised” - you know, it’s all relative ;)

Oh, thanks. But I even do not clearly understand the hardware implementation detail for one instruction. For example, in a kernel, double a = b[tid] * c[tid], (b c is in global memory, a is in register) , how do the SM, SP organize the operation? External Media

_Big_Mac · April 22, 2009, 7:30pm

What you wrote compiles to more than a single instruction really.

First, the hardware will issue a read that will place the value of b[tid] in a register (implicitly). Then it will do the same for c[tid]. This is because we can’t do arithmetic or logic operations on anything but registers (and perhaps shared memory but I’d bet not). And finally, it will perform a mul on the two new registers. So it basically translates the above into something like this (conceptually):

double a;

double implicitB = load( b[tid] );

double implicitC = load( c[tid] );

a = b * c;

Now, to be fair, something like load( b[tid] ) actually compiles to 3 PTX instructions.

load the address of array b into a register (let’s say r1)
since b[tid] is really the address of b plus the offset given by tid, here’s an addition, and the result of that addition goes to r2.
implicitB = load( r2 );

(and that’s assuming we already have tid in some register)

Here’s a PTX to see how it looks through a looking glass

ld.param.u32 	%r3, [__cudaparm__Z5amulbPfS_S__b];	

	add.u32 	%r4, %r3, %r2;	   

	ld.global.f32 	%f1, [%r4+0];  

	ld.param.u32 	%r5, [__cudaparm__Z5amulbPfS_S__c];

	add.u32 	%r6, %r5, %r2;   

	ld.global.f32 	%f2, [%r6+0];  

	mul.f32 	%f3, %f1, %f2;

First three instructions do the load in the way I just described, that is broken down into:

load the address passed in function parameter into a register (here, into %r3)
add to this our tid (tid sits in %r2) giving %r4 - the address of the specific element from b array that we want to read (The syntax for “x = y + z” is “add.u32 x, y, z”)
read the value pointed by %r4 from global memory into a local floating point register %f1. The [%r4+0] notation means we don’t apply any fixed offset to the addressing.

Then we do the same to c[tid] and finally we do a mul.f32, on those two registers we just read from global mem and giving output to %f3. %f3 is our ‘a’ variable (note, I actually used single precision floats here).

Now this is how it would look in hardware as well if it wasn’t massively parallel. Notice a funny thing in the code above - there’s nothing telling you that there are multiple threads performing this.

It gets slightly more complicated by the fact that we do threads in warps. PTXes are not the actual assembly and this gets later compiled to machine code, during which process some wild things may happen. Perhaps decuda demystifies this but I’ve never toyed with it. PTX is as low as I’m willing to go at this point :)

BTW: If I totally missed the point and gave an unnecessary lecture on assembly coding while you wanted the sub-assembly hardware stuff, then sorry, my bad :)

Jamie_K · April 22, 2009, 9:30pm

Big_Mac, your detailed analysis is only true if the variabless in question are global device variables. In that case, the code will certainly be IO bound. The time needed for loads from memory will vastly outweigh the number of instructions, or the number of arithmetic operations per clock cycle, which is the point of this thread.

For problems that are not IO bound, which are the ones where arithmetic throughput matters, the operations must be predominantly operations on registers.

cmmfir · April 23, 2009, 1:31pm

What you wrote compiles to more than a single instruction really.

First, the hardware will issue a read that will place the value of b[tid] in a register (implicitly). Then it will do the same for c[tid]. This is because we can’t do arithmetic or logic operations on anything but registers (and perhaps shared memory but I’d bet not). And finally, it will perform a mul on the two new registers. So it basically translates the above into something like this (conceptually):
double a;

double implicitB = load( b[tid] );

double implicitC = load( c[tid] );

a = b * c;
Now, to be fair, something like load( b[tid] ) actually compiles to 3 PTX instructions.

load the address of array b into a register (let’s say r1)

since b[tid] is really the address of b plus the offset given by tid, here’s an addition, and the result of that addition goes to r2.

implicitB = load( r2 );

(and that’s assuming we already have tid in some register)

Here’s a PTX to see how it looks through a looking glass
ld.param.u32 	%r3, [__cudaparm__Z5amulbPfS_S__b];	

	add.u32 	%r4, %r3, %r2;	   

	ld.global.f32 	%f1, [%r4+0];  

	ld.param.u32 	%r5, [__cudaparm__Z5amulbPfS_S__c];

	add.u32 	%r6, %r5, %r2;   

	ld.global.f32 	%f2, [%r6+0];  

	mul.f32 	%f3, %f1, %f2;
First three instructions do the load in the way I just described, that is broken down into:

load the address passed in function parameter into a register (here, into %r3)

add to this our tid (tid sits in %r2) giving %r4 - the address of the specific element from b array that we want to read (The syntax for “x = y + z” is “add.u32 x, y, z”)

read the value pointed by %r4 from global memory into a local floating point register %f1. The [%r4+0] notation means we don’t apply any fixed offset to the addressing.

Then we do the same to c[tid] and finally we do a mul.f32, on those two registers we just read from global mem and giving output to %f3. %f3 is our ‘a’ variable (note, I actually used single precision floats here).

Now this is how it would look in hardware as well if it wasn’t massively parallel. Notice a funny thing in the code above - there’s nothing telling you that there are multiple threads performing this.

It gets slightly more complicated by the fact that we do threads in warps. PTXes are not the actual assembly and this gets later compiled to machine code, during which process some wild things may happen. Perhaps decuda demystifies this but I’ve never toyed with it. PTX is as low as I’m willing to go at this point :)

BTW: If I totally missed the point and gave an unnecessary lecture on assembly coding while you wanted the sub-assembly hardware stuff, then sorry, my bad :)

Thank you very much. Your assembly analysis is very helpful. Of course, I’m very interesed to the hardware stuff and it seems some time will be cost for me to understand the multiprocessor/scalar_processor’s working process. :)

Topic		Replies	Views
Double/transendental architecture behavior CUDA Programming and Performance	14	7013	December 22, 2008
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	50318	August 24, 2009
A question on single and double precision performance calculation with CUDA cores CUDA Programming and Performance	7	2524	May 31, 2024
Single-Precision Floating-Point Basic Arithmetic Throughput CUDA Programming and Performance	2	4380	October 7, 2009
2 Small Questions CUDA Programming and Performance	3	1920	August 9, 2008
Number of 64 bit floating point operations per clock cycle? CUDA Programming and Performance	2	3983	July 8, 2014
Emulated double precision Double single routine header CUDA Programming and Performance	24	49571	October 18, 2010
8800GTX:345GFlops or 518GFlops? CUDA Programming and Performance	8	9694	December 12, 2007
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4196	April 8, 2012
shared memory double precision problem! Legacy PGI Compilers	1	1660	October 25, 2018

clock cycles of double operation

Related topics