Questin regarding latency

werder85 · August 25, 2010, 11:17am

Hi!
I’m pretty new to CUDA programming and I’m just going through the NVIDIA manuals.
I’ve got a question regarding the latency (the number of clock cycles it takes for a warp to be ready to execute its next instruction).
They say:
“In order to hide a latency of L clock cycles with basic single-precision floating-point arithmetic instructions L/4 (rounded up to nearest integer) instructions are required for devices of compute capability 1.x since a multiprocessor issues one such instruction per warp over 4 clock cycles.”

What does it mean that a multiprocessor issues an instruction? Does it mean that the instruction is executed?
That’s what I first thought, but on the next page I read the following:
“If all input operands are registers, latency is caused by register dependencies, i.e., some of the input operands are written by some previous instruction(s) whose execution has not completed yet. In the case of a back-to-back register dependency (i.e., some input operand is written by the previous instruction), the latency is equal to the execution time of the previous instruction and the warp scheduler must schedule instructions for different warps during that time. Execution time varies depending on the instruction, but it is typically about 22 clock cycles, which translates to 6 warps for devices of compute capability 1.x.”

Here it states that the execution time is generally 22 clock cycles…

Can someone help and understand the exact terminology (I’m not a native English speaker)?

Thanks very much!

ONeill · August 25, 2010, 1:27pm

Here we had a little chit-chat bout issuing instructions, latency and all that stuff: how many threads to hide latency

ONeill · August 25, 2010, 1:27pm

Here we had a little chit-chat bout issuing instructions, latency and all that stuff: how many threads to hide latency

werder85 · August 26, 2010, 12:14pm

If I don’t understand the basic terminology, I can’t understand that topic too…

So the main question remains: What does it mean that a multiprocessor issues an instruction? Does it mean that the instruction is executed?

Thanks again!

werder85 · August 26, 2010, 12:14pm

If I don’t understand the basic terminology, I can’t understand that topic too…

So the main question remains: What does it mean that a multiprocessor issues an instruction? Does it mean that the instruction is executed?

Thanks again!

tera · August 26, 2010, 1:48pm

The two basic terms are latency and throughput. The latency is the time it takes from starting execution until the results are available (so basically all the time it takes to execute the instruction). This is 22 to 24 cycles for most instructions in devices of compute capability 1.x.
Throughput on the other hand is the number of instructions that finish execution in a given amount of time. For many instructions it is one instruction every four cycles. This is faster than the latency number suggests because execution of multiple instruction is overlapped (“pipelined”).

Issuing an instruction means that it is sent to the functional units that execute it, so basically that starts its execution (although at that time it has already been decoded, so the total time for processing an instruction is larger).
Finishing execution of an instruction, so that the results are available, is called “retiring”. Some processors (but no CUDA devices so far) speculatively start to execute conditional instructions before the result of the condition is available, and only retire the instruction if the condition evaluates to true. Thus the number of instructions retiring per time (the throughput) may be smaller than the number of instructions starting execution.

Here is some more info for compute 1.x devices I wrote in a different thread:

Each core can start execution of one instruction of a thread per clock cycle. As a multiprocessor has only 8 cores, this takes 4 cycles for the 32 threads of a warp. So after 4 cycles, the warp can start execution of another instruction. This is the 4 cycles throughput number given in the Programming guide.

However, the result of the operation is only known after execution of the instruction finishes after about 22…24 (fast) clock cycles. So if the next operation wants to use this result (“depends on it”), it has to wait for 24 (instead of 4) clock cycles. Now if you have 6 warps running in a round-robin fashion, each warps instruction results will be available just in time for the next instruction of that warp to run. This is called latency hiding, as it now appears that latency is irrelevant and only throughput determines the execution speed.

For this it does not matter whether the 6 warps are from the same thread block or a different one. However, warps from the same block tend to block (waiting for memory or at a __syncthreads()) at the same time (as they execute the same instruction stream and syncronize at each __syncthreads()), so there is a slight advantage if some of the warps are from a different block.

There is another form of “latency hiding” in the Programming Guide. Since the latency numbers are mostly irrelevant once you have at least 6 warps per SM, the Programming Guide just gives the throughput numbers and the rule to have at least 6 warps per SM. Latency is mostly hidden in the Guide as well. :)

Hope this helps.

tera · August 26, 2010, 1:48pm

The two basic terms are latency and throughput. The latency is the time it takes from starting execution until the results are available (so basically all the time it takes to execute the instruction). This is 22 to 24 cycles for most instructions in devices of compute capability 1.x.
Throughput on the other hand is the number of instructions that finish execution in a given amount of time. For many instructions it is one instruction every four cycles. This is faster than the latency number suggests because execution of multiple instruction is overlapped (“pipelined”).

Issuing an instruction means that it is sent to the functional units that execute it, so basically that starts its execution (although at that time it has already been decoded, so the total time for processing an instruction is larger).
Finishing execution of an instruction, so that the results are available, is called “retiring”. Some processors (but no CUDA devices so far) speculatively start to execute conditional instructions before the result of the condition is available, and only retire the instruction if the condition evaluates to true. Thus the number of instructions retiring per time (the throughput) may be smaller than the number of instructions starting execution.

Here is some more info for compute 1.x devices I wrote in a different thread:

Each core can start execution of one instruction of a thread per clock cycle. As a multiprocessor has only 8 cores, this takes 4 cycles for the 32 threads of a warp. So after 4 cycles, the warp can start execution of another instruction. This is the 4 cycles throughput number given in the Programming guide.

However, the result of the operation is only known after execution of the instruction finishes after about 22…24 (fast) clock cycles. So if the next operation wants to use this result (“depends on it”), it has to wait for 24 (instead of 4) clock cycles. Now if you have 6 warps running in a round-robin fashion, each warps instruction results will be available just in time for the next instruction of that warp to run. This is called latency hiding, as it now appears that latency is irrelevant and only throughput determines the execution speed.

For this it does not matter whether the 6 warps are from the same thread block or a different one. However, warps from the same block tend to block (waiting for memory or at a __syncthreads()) at the same time (as they execute the same instruction stream and syncronize at each __syncthreads()), so there is a slight advantage if some of the warps are from a different block.

There is another form of “latency hiding” in the Programming Guide. Since the latency numbers are mostly irrelevant once you have at least 6 warps per SM, the Programming Guide just gives the throughput numbers and the rule to have at least 6 warps per SM. Latency is mostly hidden in the Guide as well. :)

Hope this helps.

Topic		Replies	Views
Warp Size Question CUDA Programming and Performance	21	13921	June 18, 2010
How to understand the "hide latency" CUDA Programming and Performance	13	3108	August 8, 2024
Basic question about hiding latency CUDA Programming and Performance	6	2109	July 9, 2014
Things related to stall reasons... or not so related CUDA Programming and Performance	6	1976	April 14, 2017
Instruction Latency CUDA Programming and Performance	18	43640	January 18, 2010
Basic question about warps CUDA Programming and Performance	14	6571	June 9, 2009
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15556	February 4, 2011
How many parallel threads? CUDA Programming and Performance	19	9862	October 1, 2021
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7164	March 30, 2013
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5556	July 28, 2009

Questin regarding latency

Related topics