Basic question about hiding latency

lang · July 9, 2014, 2:42pm

“With enough warps around, the hardware will likely find a warp to execute at any point in time, thus making full use of the execution hardware in spite of these long-latency operations. With warp scheduling, the long waiting time of warp instructions is hidden by executing instructions from other warps.”

This is quoted from "programming massively parallel processors (David Kirk). I have a question about whether I understand that correctly.

Does that mean when a half warp hits something like global memory access, it has to wait, then another warp begins or continues?
My question is, if all threads start with long-latency operations, can latency be hidden?

If not, I become confused about another trick I learned here
[url]http://on-demand.gputechconf.com/gtc/2014/presentations/S4170-put-neutron-transport-sims-nuclear-reactions.pdf[/url]
In that GTC slide(page 24), the trick “load outside inner loop” moves global access in 5 steps into first several lines and save in registers, and then access these registers.

How to explain?

Robert_Crovella · July 9, 2014, 3:02pm

Why ask about half warps? Do you want an answer specific to cc1.x devices?

When a warp hits something that has latency, it has to wait. Actually, the situation is that a load by itself does not cause a stall. Attempting to use the loaded value will stall, until the value has actually been loaded (e.g. from global memory).

When a warp stalls, the scheduler will place it in a “waiting” queue until the stall condition has been resolved. The scheduler will then attempt to schedule other warps from the “ready” queue, if any exist.

If an early kernel operation encounters a stall, then the latency for that operation perhaps cannot be hidden by other warps in that threadblock. However, other warps in other threadblocks may be at a different stage of execution, and therefore may be able to be scheduled.

The latency associated with an early kernel operation in the earliest threadblocks to get scheduled probably cannot be hidden. But latency hiding is not something we talk about in the absolute, but rather statistically, for the most part.

lang · July 9, 2014, 3:06pm

Answer your question first.
To be honest, what I have read lets me assume all are based on half-warp. Do you imply new devices have supported whole warp?

Robert_Crovella · July 9, 2014, 3:20pm

I consider devices that are 2.x or newer to be principally focused on scheduling warps, not half warps. Please educate me if I am wrong. It’s true that cc 2.0 (Fermi) devices do execute under the hood as a half-warp, but this is due to implementation specifics of fermi (“hotclock”) which runs the GPU cores at twice the nominal execution frequency, and uses the cores to execute each half of a warp (instruction) sequentially on the cores. But this is an implementation detail in my opinion. Post Fermi devices both schedule and execute at the warp level, AFAIK, having eliminated the “hotclock” arrangement. Most architectures are pipelined, and various instructions are constrained by the availability of the necessary SFUs, so things are maybe not as simple or as I have described, but I don’t understand why we need to talk about half-warps, excepting if you care about cc1.x execution, where the term half warp is specifically mentioned and defined. Again, Please educate me if I am wrong.

Half-warp is mostly used during the description of cc1.x devices in the programming guide:

[url]Programming Guide :: CUDA Toolkit Documentation

You might also search the programming guide for every instance of “half-warp” and draw your own conclusions as to its significance.

lang · July 9, 2014, 3:34pm

You have answered questions from an engineering student new to CUDA very well and makes thread stall, scheduling and latency hidden clear to me.

Your latest reply reminds me of another question. What do you mean by ‘most architectures are pipelined’? I think I know the difference between SIMD(parallel) and SIMD(segmented, which is pipelined?). GPU is declared as SIMT. But the execution in warp size seems SIMD. Is that SIMD(parallel) or SIMD(segmented)?

Robert_Crovella · July 9, 2014, 3:43pm

I don’t know the difference between SIMD(parallel) and SIMD(segmented). When I said “most architectures are pipelined” I should have said “most CUDA GPU architectures are pipelined”. And I mean nothing more than that many instructions require more than a single execution cycle to complete. But the throughputs are still valid. When I am talking about throughput, I usually refer to the number of operations that can be “retired” in a given clock cycle. It does not (necessarily) mean that those operations took only one clock cycle to execute in their entirety.

[url]Programming Guide :: CUDA Toolkit Documentation

To pose a specific example, suppose we have a Kepler 3.0 GPU (SMX). Let’s say a single-precision FMA instruction is to be scheduled and executed. The warp scheduler will select that warp/instruction, and allocate it to 32 “cores” or single-precision “units”, which are physical hardware entities that perform floating-point operations, something like an ALU.

Now if you read the above table, you might come to the mistaken conclusion that the 32 SP floating point results will be computed and available on the next cycle. That would be incorrect. The results will not be available for some number of cycles (in the ballpark of 20, I think, but don’t quote me on that). This is called arithmetic latency. However, due to the pipelined nature of processing, a new FMA instruction may be scheduled on the very same 32 SP-units, by a warp scheduler in the SMX, on the very next instruction cycle. So ~20 cycles later, the first set of 32 SP FMA results become available, and on the instruction cycle following that one, another set of 32 SP FMA results become available. This is possible because each SP-unit is pipelined.

That is my understanding.

lang · July 9, 2014, 3:57pm

That makes sense now. The pipeline you mean is at a more atomic level and means pipelined operation. That is not as macroscopic parallelism as what I knew.

[url]Dropbox - File Deleted

Topic		Replies	Views
How to understand the "hide latency" CUDA Programming and Performance	13	3549	August 8, 2024
Warp Size Question CUDA Programming and Performance	21	13981	June 18, 2010
Questin regarding latency CUDA Programming and Performance	6	4247	August 26, 2010
Stupid (?) questions about Warp vs. Half Warp vs. SM width CUDA Programming and Performance	3	43767	November 12, 2010
Basic question about warps CUDA Programming and Performance	14	6609	June 9, 2009
How to keep the float pipe busy? CUDA Programming and Performance	7	709	April 23, 2019
Why only half-warp? CUDA Programming and Performance	6	12719	April 15, 2010
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24866	September 6, 2009
Instruction Latency CUDA Programming and Performance	18	43759	January 18, 2010
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	8939	January 24, 2008

Basic question about hiding latency

Related topics