Basic question about hiding latency

“With enough warps around, the hardware will likely find a warp to execute at any point in time, thus making full use of the execution hardware in spite of these long-latency operations. With warp scheduling, the long waiting time of warp instructions is hidden by executing instructions from other warps.”

This is quoted from "programming massively parallel processors (David Kirk). I have a question about whether I understand that correctly.

Does that mean when a half warp hits something like global memory access, it has to wait, then another warp begins or continues?
My question is, if all threads start with long-latency operations, can latency be hidden?

If not, I become confused about another trick I learned here
http://on-demand.gputechconf.com/gtc/2014/presentations/S4170-put-neutron-transport-sims-nuclear-reactions.pdf
In that GTC slide(page 24), the trick “load outside inner loop” moves global access in 5 steps into first several lines and save in registers, and then access these registers.

How to explain?

Why ask about half warps? Do you want an answer specific to cc1.x devices?

When a warp hits something that has latency, it has to wait. Actually, the situation is that a load by itself does not cause a stall. Attempting to use the loaded value will stall, until the value has actually been loaded (e.g. from global memory).

When a warp stalls, the scheduler will place it in a “waiting” queue until the stall condition has been resolved. The scheduler will then attempt to schedule other warps from the “ready” queue, if any exist.

If an early kernel operation encounters a stall, then the latency for that operation perhaps cannot be hidden by other warps in that threadblock. However, other warps in other threadblocks may be at a different stage of execution, and therefore may be able to be scheduled.

The latency associated with an early kernel operation in the earliest threadblocks to get scheduled probably cannot be hidden. But latency hiding is not something we talk about in the absolute, but rather statistically, for the most part.

Answer your question first.
To be honest, what I have read lets me assume all are based on half-warp. Do you imply new devices have supported whole warp?

I consider devices that are 2.x or newer to be principally focused on scheduling warps, not half warps. Please educate me if I am wrong. It’s true that cc 2.0 (Fermi) devices do execute under the hood as a half-warp, but this is due to implementation specifics of fermi (“hotclock”) which runs the GPU cores at twice the nominal execution frequency, and uses the cores to execute each half of a warp (instruction) sequentially on the cores. But this is an implementation detail in my opinion. Post Fermi devices both schedule and execute at the warp level, AFAIK, having eliminated the “hotclock” arrangement. Most architectures are pipelined, and various instructions are constrained by the availability of the necessary SFUs, so things are maybe not as simple or as I have described, but I don’t understand why we need to talk about half-warps, excepting if you care about cc1.x execution, where the term half warp is specifically mentioned and defined. Again, Please educate me if I am wrong.

Half-warp is mostly used during the description of cc1.x devices in the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-1-x

You might also search the programming guide for every instance of “half-warp” and draw your own conclusions as to its significance.

You have answered questions from an engineering student new to CUDA very well and makes thread stall, scheduling and latency hidden clear to me.

Your latest reply reminds me of another question. What do you mean by ‘most architectures are pipelined’? I think I know the difference between SIMD(parallel) and SIMD(segmented, which is pipelined?). GPU is declared as SIMT. But the execution in warp size seems SIMD. Is that SIMD(parallel) or SIMD(segmented)?

I don’t know the difference between SIMD(parallel) and SIMD(segmented). When I said “most architectures are pipelined” I should have said “most CUDA GPU architectures are pipelined”. And I mean nothing more than that many instructions require more than a single execution cycle to complete. But the throughputs are still valid. When I am talking about throughput, I usually refer to the number of operations that can be “retired” in a given clock cycle. It does not (necessarily) mean that those operations took only one clock cycle to execute in their entirety.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

To pose a specific example, suppose we have a Kepler 3.0 GPU (SMX). Let’s say a single-precision FMA instruction is to be scheduled and executed. The warp scheduler will select that warp/instruction, and allocate it to 32 “cores” or single-precision “units”, which are physical hardware entities that perform floating-point operations, something like an ALU.

Now if you read the above table, you might come to the mistaken conclusion that the 32 SP floating point results will be computed and available on the next cycle. That would be incorrect. The results will not be available for some number of cycles (in the ballpark of 20, I think, but don’t quote me on that). This is called arithmetic latency. However, due to the pipelined nature of processing, a new FMA instruction may be scheduled on the very same 32 SP-units, by a warp scheduler in the SMX, on the very next instruction cycle. So ~20 cycles later, the first set of 32 SP FMA results become available, and on the instruction cycle following that one, another set of 32 SP FMA results become available. This is possible because each SP-unit is pipelined.

That is my understanding.

That makes sense now. The pipeline you mean is at a more atomic level and means pipelined operation. That is not as macroscopic parallelism as what I knew.

https://www.dropbox.com/s/njztab2toozzb8k/2014-07-09%2011.52.26.jpg