Instruction Latency

chris22 · May 2, 2008, 9:16pm

Does anyone know the latency for bitwise instructions or logical comparisons?

PINS · May 12, 2008, 12:23am

The programming guide knows:

“4 clock cycles for floating-point add, floating-point multiply, floating-point multiply-add, integer add, bitwise operations, compare, min, max, type conversion instruction”

:)

chris22 · May 12, 2008, 4:04pm

That refers to throughput for a warp, not latency.

paulius · May 12, 2008, 6:50pm

As a related guideline, the programming guide does state that 6 warps (or 192 threads) per multiprocessor are needed to hide register dependencies.

Paulius

chris22 · May 18, 2008, 2:43am

Well, that implies over twenty cycles of latency. Does that merely mean that the pipeline has twenty plus stages? Or does that it mean two sequential machine instructions in a single thread must have must have twenty cycles between them to avoid stalls? Many CPUs have deep pipelines but utilize a variety of hardware techniques to eliminate data hazards.

Mark_Harris · May 22, 2008, 10:26am

Yes, there are over 20 cycles of latency. But unlike a serial processor that needs to use lots of hardware techniques to hide this latency, the GPU can just run enough threads to hide this latency. (We trade all that hardware complexity for more computation units.)

So, as Paulius says, as long as you have at least 6 warps of threads resident, you will hide the ALU pipeline latency. Note that these 6 warps need not be from the same thread block. So even if you have 64-thread blocks, as long as each uses at most 25% of the register (up to 42 registers per thread) and at most 25% of shared memory (4096 bytes per thread block), you will be able to hide this latency.

In other words: 25% occupancy to hide ALU pipeline latency.

Mark

Axure · March 17, 2009, 11:40am

This is a very interesting thread. I’ve found it through the AnandTech article that tries to draw some conclusions from your discussions. However, I am not fully convinced that things have been named explicitly here so I’ll ask for a more specific answer to chris22’s lat post.

Are we indeed dealing with a pipeline 20-sth deep? Or are there other factors that force a latency of over 20 cycles between consecutive instructions in a single thread?

If it’s the former, does it mean that we have in fact four small, 5- or 6-stage deep pipelines concatenated into one long SP pipeline?

Sylvain_Collange · March 21, 2009, 10:44am

20 cycles is the latency of the operand gather, execution and writeback stages only (between two consecutive instructions that have a read-after-write dependency). The fetch, decode and schedule stages take around 8 more cycles, IIRC.

Where the 20-cycle figure comes from :

operand gather:

Input operands from registers are read sequentially from the same bank(s) of the register file.

Takes from 2 cycles (1 32-bit input operand) to 12 cycles (3 64-bit input operands).

(actually 1 to 6 cycles as the RF clock is half the shader clock)
execution:

The 32 operations of the warp are dispatched to the available units.

Takes from 4 cycles (MAD, 8 units) to 32 cycles (double-precision, 1 unit) just to start the pipeline.

Then the execution pipe latency is around 10 cycles.
readback:

Same as gather, takes 2 to 4 cycles.

Summing it up, you get ~18 cycles for a MOV, ~20 cycles for an ADD and ~22 cycles for a MAD.

GT200 added two more cycles to most (all?) instructions compared to G9x.

Of course, everything above is mostly speculation from microbenchmark results, patent reading and educated guesses…

Axure · March 21, 2009, 11:39am

Wow, that’s an impressive analysis.

Thanks a lot for taking the time to explain! :)

xjtusnail · March 27, 2009, 5:39am

20 cycles is the latency of the operand gather, execution and writeback stages only (between two consecutive instructions that have a read-after-write dependency). The fetch, decode and schedule stages take around 8 more cycles, IIRC.

Where the 20-cycle figure comes from :

operand gather:

Input operands from registers are read sequentially from the same bank(s) of the register file.

Takes from 2 cycles (1 32-bit input operand) to 12 cycles (3 64-bit input operands).

(actually 1 to 6 cycles as the RF clock is half the shader clock)

execution:

The 32 operations of the warp are dispatched to the available units.

Takes from 4 cycles (MAD, 8 units) to 32 cycles (double-precision, 1 unit) just to start the pipeline.

Then the execution pipe latency is around 10 cycles.

readback:

Same as gather, takes 2 to 4 cycles.

Summing it up, you get ~18 cycles for a MOV, ~20 cycles for an ADD and ~22 cycles for a MAD.

GT200 added two more cycles to most (all?) instructions compared to G9x.

Of course, everything above is mostly speculation from microbenchmark results, patent reading and educated guesses…

Hi, Sylvain,

I execute dependent operation a += b*c to measure pipeline latency.

my GPU is 9600GT, launch only one thread.

let a, b, c all reside in register, latency is 24 cycles;

let a, c reside in register, b in constant memory, latency is 22 cycles;

let a, c reside in register, b in share memory, latency is 26 cycles.

Does the difference of operand selection result in above results?

Can anyone give me a detailed explanation or other related measure data?

Do you have some comment on my test?

Sylvain_Collange · March 27, 2009, 8:45pm

Your results are consistent with those obtained by vvolkov in his paper:

http://bebop.cs.berkeley.edu/pubs/volkov20…enchmarking.pdf

I consistently get 2 cycles less in my own experiments, but this is probably due to a difference in measurement setup (I assemble my own assembly code with cudasm for benchmarks on instructions).

Registers are read sequentially for one instruction, this latency can be overlapped with reads from constant memory and shared memory latency is higher.

For the gory details about registers, refer to this patent.

xjtusnail · March 28, 2009, 6:43am

Thank you very much.

BTW, below is my test data.

Operation Latency a b c

a+=b*c 22.01 reg constant memory reg

a+=b*c 24.003 reg reg reg

a+=b*c 26.006 reg share memory reg

a+=b 20.01 reg constant memory

a+=b 20.006 reg reg

a+=b 24.007 reg share memory

a*=b 20.01 reg constant memory

a*=b 20.009 reg reg

a*=b 24.007 reg share memory

Do you have some benchmark or data can validate that Registers are read sequentially for one instruction?

if it is true, according to my data, is it mean the latency of getting operand from constant cache to ALU is less than 4 cycles, and getting from share memory is about 8 cycles?

Sylvain_Collange · March 29, 2009, 11:03am

Well, you already answered your own question. ;)

Also try:

a=b*a+a

a=a*a+a

a=a+a

a=a*a

…

For the constant cache it would make sense, since the 32-wide load is split into two halfs, and tho constant cache runs at half the shader clock (according to the documentation).

Shared memory requires arbitration logic and a full crossbar between execution units and SRAM banks, so it will not be surprising if it require one extra (slow) clock cycle.

xjtusnail · March 29, 2009, 2:10pm

thank you, Sylvain.

Jamie_K · March 29, 2009, 5:07pm

Does anyone know the throughput and latency of integer divide and modulus operations?

Or have source code for these types of tests?

T.B · March 30, 2009, 12:39pm

The problem with these is that they are not instructions but done in software. As the code is not branch-free, the performance will depend on what you put into them. Joy.

jaka1 · January 18, 2010, 10:53pm

By the way, it is not clear to me how the operation a += bc; takes only ~ 24 clock cycles.
bc would require 20+ clock cycles to be finished. Then it needs
another 20+ clock cycles for the addition which has to wait till the multiplication is finished.

Thanks

jack · January 18, 2010, 11:40pm

I don’t know the low-level stuff as well as Sylvain Collange or Gregory Diamos, but if I had to guess, it would be that a += b*c is actually using an FMA (fused-multiply-add) instruction (a = a + b * c)…which probably doesn’t have a whole lot more latency than a single multiple or add instruction.

jaka1 · January 18, 2010, 11:46pm

Thanks profquail, this makes sense.

Topic		Replies	Views
How to understand the "hide latency" CUDA Programming and Performance	13	3108	August 8, 2024
Pipeline Latencies on GPU vs CPU typical CPU pipeline latencies? CUDA Programming and Performance	17	11514	December 7, 2009
Does %clock measure actual GPU cycles, or what? CUDA Programming and Performance	5	1572	July 9, 2019
Questin regarding latency CUDA Programming and Performance	6	4243	August 26, 2010
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	8931	January 24, 2008
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7164	March 30, 2013
Latency and low-level performance questions CUDA Programming and Performance	10	4280	June 23, 2015
a deep dive into Instruction-level parallelism CUDA Programming and Performance	17	4867	December 18, 2018
latency and throughput of MAD operation? CUDA Programming and Performance	0	3321	December 10, 2009
Basic question about warps CUDA Programming and Performance	14	6571	June 9, 2009

Instruction Latency

Related topics