Is warp a hardware thread?

From the software’s perspective, a warp consists of 32 software threads. However, in terms of how warp is actually implemented in the hardware – except for Volta – it seems like a single warp is closer to a single thread in the traditional sense. A warp has a program counter, single program stack, and, most importantly, warp scheduler issues instruction in terms of warps. The only thing that makes each software thread seem like a thread is that they each have their own sets of registers. So a CUDA thread is just an abstraction for 1/32 of a warp, and warp seems to be closer to what we usually think of as a thread. Am I correct? Please let me know if I am wrong.

A warp is a bundle of threads, thus the name (warp and weft are the vertical threads and horizontal threads in weaving). So clearly the concepts are closely related. However, a warp is not the same as a thread. Not conceptually and not at hardware level.

While the threads in a warp generally proceed in lockstep, individual threads in a warp can be inactive, which allows the implementation of control-transfer constructs like if-then-else. Hardware maintains an active-thread mask that is manipulated at divergence and convergence points in the control flow. The handling of those is hidden at CUDA and PTX level. You can see glimpses of it at SASS level (less so on modern GPU architectures), for example in the form of SSY instructions.

Note that this active-thread mask is a mechanism orthogonal to predicated execution, which is basically a write-back inhibitor affecting individual instructions.

One could adopt a mental model that sees warp execution as thread execution on a (currently) 32-lane wide SIMD architecture with lane-masking capability, but I don’t see how this is helpful in understanding CUDA’s basic execution model as described in documentation. That is why NVIDIA refers to the CUDA model as SIMT rather than SIMD. What a CUDA programmer deals with at HLL level is for the most part single-thread execution, without having to worry about the structure of the underlying hardware, in stark contrast to classical SIMD, which exposes the hardware more (and thus often requires re-writes for new generations of SIMD instruction sets with increasing lane count: MMX, SSE, AVX, AVX512).

@njuffa: “One could adopt a mental model that sees warp execution as thread execution on a (currently) 32-lane wide SIMD architecture with lane-masking capability, but I don’t see how this is helpful in understanding CUDA’s basic execution model as described in documentation.”

I find it much more helpful to use that mental model (a warp is just a 32-lane SIMD thread), because it’s much closer to what’s happening under the hood, and so when I’m writing code that needs to be well-optimized, I can rely on these stronger mental models to work out how to speed things up.

I think the “warp is a bundle of threads” model/analogy tricks the programmer and ends up creating more costs than benefits. For example, if you have this mental model then it’s not obvious why divergence/branching should be avoided. It also makes it hard to understand why GPUs need to be significantly over-provisioned with warps in order to reach maximum utilization. These things don’t make sense if you don’t understand what’s going on at the lower level.

@njuffa: “That is why NVIDIA refers to the CUDA model as SIMT rather than SIMD.”

https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads

CUDA      OpenCL       Hennessy & Patterson
----      ------       --------------------
Thread    Work-item    Sequence of SIMD Lane operations
Warp      Wavefront    Thread of SIMD Instructions
Block     Workgroup    Body of vectorized loop
Grid      NDRange      Vectorized loop

A couple of Turing award winners (Hennessy & Patterson) think that “Thread of SIMD Instructions” makes more sense than “warp”.

Perhaps Nvidia was just trying to go for an abstraction that would be durable to underlying hardware changes, and so they went with very “safe”, “abstract” terminology. Still, overloading the term “thread” was perhaps a bad idea, I think.

I am reasonably sure that was a motivation. As I recall, GPUs with compute capability 1.x were actually organized by the half-warp at hardware level, and since Volta there are now separate program counters for each thread in a warp. I think that makes SIMT sufficiently different even from SIMD with lane masking.

As a practical consequence, from a programmer’s perspective, CUDA code looks (to first order) like scalar code, which is quite different from any SIMD code I have used. That is what makes CUDA so easy to use and thus powerful, in my not so humble opinion. From what I have seen, efforts to map CUDA-like code to classical SIMD architectures have not been very successful (as in, widely used). Even 25 years after MMX it is, for the nost part, still hundreds of intrinsics to learn except for the few cases autovectorizers are able to handle.

As for the pedigree of the warp and SIMT terminology, in my recollection they were coined by the late (https://www.legacy.com/obituaries/mercurynews/obituary.aspx?n=john-nickolls&pid=153872993) John Nickolls, who was one of the architects of the MasPar-1 (a first-generation massively parallel computer architecture in the 1980s) and later the key architecture person in adding compute support to NVIDIA GPUs.

If people prefer a different model of abstraction that is useful to them, they are entire free to adopt that. For what it is worth, appeals to authority fall on deaf ears with me, I am afraid.

appeals to authority fall on deaf ears with me, I am afraid.

Rightly so! It was a bad argument. Though I suppose I was just trying to make the point that OP is not alone in finding this mental model useful.