Programming Model/Hardware Implementation mapping

Nate75Sanders · February 4, 2008, 7:46pm

(disclaimer: I’ve read chapters 2 and 3 both a few times…still having trouble understanding the details)

I’m trying to get a good handle on how the programming interface maps to the hardware and how the hardware executes programs/generally does its work.

1 - Reconciling the programming model against the execution model is giving me some difficulty and I think a good deal of the difficulty is due to the fact that I don’t entirely understand how “warp” and “block” are related. It seems that a warp is a subset of a block, but why is the term “half-warp” important? This must be related to how warps/half-warps are scheduled for execution on the processors? Is there a perhaps different or more in-depth explanation of how work gets scheduled onto the device or perhaps what parts of the programming model or a piece of code are blocks or how blocks and warps are related? …maybe with an example

I think I’d probably have a better idea of things if I could see a program and see exactly how work was mapped onto a specific card.

2 -For example, I have a G92-based 8800 GTS 512MB. According to the website on the device, I have 128 stream processors. I thought that I read somewhere that it has 16 multiprocessors, but maybe I"m wrong about that…that would, according to Figure 3-1 in the Programming Guide, imply that I have 8 “Processors” per “Multiprocessor”?

3 - I would assume at least that all Processors within a Multiprocessor must be executing the same kernel at the same time, but do all Multiprocessors also have to be executing the same kernel at the same time?

Section 3.2 of the Programming Guide seems particularly dense to me. I’m pretty sure with more work and looking through code examples and reading the forum, I’ll get it, but after looking at CUDA for the week that I have, it would be my #1 candidate for expansion/deeper explanation with examples.

Thanks,
Nate

DenisR · February 4, 2008, 9:13pm

all multiprocessors are executing the same kernel at the same time.
they do not need to execute the same code though (you can e.g. so a switch(blockIdx.x) and execute different code based on blockindex)
-a multiprocessor has 8 ALU’s and 1 instruction decoder running at 1/4 speed. So all 8 ALU’s perform the same instruction 4 times in a row. 4x8 = 32 = warp.
a block runs on a single multiprocessor (but more than 1 block can be running ‘at the same time’ on a multiprocessor)
all threads in a block run 32 at a time (0-31, 32-63, etc.)
I think (but cannot remember) that a half-warp has to do with memory-banks. You can search the forum, some people from NVIDIA have explained the significance of half-warps in the past.

Hope this helps.

Nate75Sanders · February 4, 2008, 9:45pm

Are the following 2 points of information related? That is , are the 32 threads in a warp the same as the 32 threads that run at a time from a block?

Also, where did you learn the details about the instruction decoder and the 8 ALUs? I think this is the type of information I’m looking for, or at least some of it.

Good reply…thanks a lot.

– Nate

Nate75Sanders · February 4, 2008, 10:02pm

Ok, I’m linking this thread because I think it’s relevant.

[url=“http://forums.nvidia.com/index.php?showtopic=29941&hl=instruction+unit”]http://forums.nvidia.com/index.php?showtop...nstruction+unit[/url]

They talk about the instruction unit running at 1/2 the ALUs, though…but I think they may be takling about G80-specific stuff, so I’m not sure. I’ll take it that for sure the instruction decoder runs at some fraction of the ALUs…I think this is starting to make more sense, but there are still a lot of details.

seibert · February 4, 2008, 10:33pm

Right, part of the reason that a warp is 32 threads, whereas a multiprocessor only has 8 ALUs is because different clocks are used in different parts of the chip. The details are a little fuzzy since we don’t have super-precise hardware documentation, but basically, the ALUs run much faster than the instruction decoder, so you want to reuse a decoded instruction for many threads. This apparently works out to a 4:1 ratio on the current chips, and so the warp size is 32. The warp scheduler decodes the instruction for the warp, and then packs the threads of the warp into the ALU pipeline. This gives it time to decode the next instruction for the next warp.

If your block has > 32 threads, then the next warp might be running the same instruction as the previous one, or not if you branch in your kernel. Or, if all of the threads in one block are waiting for global memory reads to finish, the warp scheduler will, assuming you have enough register/shared memory resources, switch to a warp in another block, which can be at a completely different instruction. (Note that blocks are never swapped in or out of multiprocessor while they are running. The kernel launcher figures out how many simultaneous blocks per multiprocessor you can run given your register and shared memory usage.)

Topic		Replies	Views
Scheduling Blocks on a Multi-Processor Block Scheduling on Multiprocessor CUDA Programming and Performance	11	6393	December 6, 2007
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15587	February 4, 2011
Stupid (?) questions about Warp vs. Half Warp vs. SM width CUDA Programming and Performance	3	43760	November 12, 2010
Basic question about warps CUDA Programming and Performance	14	6589	June 9, 2009
HOW does CUDA map to the HW General question CUDA Programming and Performance	1	5832	May 21, 2008
Half WRAP -- NEWBIE help CUDA Programming and Performance	7	5641	November 4, 2008
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12164	February 12, 2013
Clarification on concept to hardware mapping CUDA Programming and Performance	2	2249	January 11, 2008
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28705	July 4, 2019
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24851	September 6, 2009

Programming Model/Hardware Implementation mapping

Related topics