(disclaimer: I’ve read chapters 2 and 3 both a few times…still having trouble understanding the details)
I’m trying to get a good handle on how the programming interface maps to the hardware and how the hardware executes programs/generally does its work.
1 - Reconciling the programming model against the execution model is giving me some difficulty and I think a good deal of the difficulty is due to the fact that I don’t entirely understand how “warp” and “block” are related. It seems that a warp is a subset of a block, but why is the term “half-warp” important? This must be related to how warps/half-warps are scheduled for execution on the processors? Is there a perhaps different or more in-depth explanation of how work gets scheduled onto the device or perhaps what parts of the programming model or a piece of code are blocks or how blocks and warps are related? …maybe with an example
I think I’d probably have a better idea of things if I could see a program and see exactly how work was mapped onto a specific card.
2 -For example, I have a G92-based 8800 GTS 512MB. According to the website on the device, I have 128 stream processors. I thought that I read somewhere that it has 16 multiprocessors, but maybe I"m wrong about that…that would, according to Figure 3-1 in the Programming Guide, imply that I have 8 “Processors” per “Multiprocessor”?
3 - I would assume at least that all Processors within a Multiprocessor must be executing the same kernel at the same time, but do all Multiprocessors also have to be executing the same kernel at the same time?
Section 3.2 of the Programming Guide seems particularly dense to me. I’m pretty sure with more work and looking through code examples and reading the forum, I’ll get it, but after looking at CUDA for the week that I have, it would be my #1 candidate for expansion/deeper explanation with examples.
all multiprocessors are executing the same kernel at the same time.
they do not need to execute the same code though (you can e.g. so a switch(blockIdx.x) and execute different code based on blockindex)
-a multiprocessor has 8 ALU’s and 1 instruction decoder running at 1/4 speed. So all 8 ALU’s perform the same instruction 4 times in a row. 4x8 = 32 = warp.
a block runs on a single multiprocessor (but more than 1 block can be running ‘at the same time’ on a multiprocessor)
all threads in a block run 32 at a time (0-31, 32-63, etc.)
I think (but cannot remember) that a half-warp has to do with memory-banks. You can search the forum, some people from NVIDIA have explained the significance of half-warps in the past.
Are the following 2 points of information related? That is , are the 32 threads in a warp the same as the 32 threads that run at a time from a block?
Also, where did you learn the details about the instruction decoder and the 8 ALUs? I think this is the type of information I’m looking for, or at least some of it.
They talk about the instruction unit running at 1/2 the ALUs, though…but I think they may be takling about G80-specific stuff, so I’m not sure. I’ll take it that for sure the instruction decoder runs at some fraction of the ALUs…I think this is starting to make more sense, but there are still a lot of details.
Right, part of the reason that a warp is 32 threads, whereas a multiprocessor only has 8 ALUs is because different clocks are used in different parts of the chip. The details are a little fuzzy since we don’t have super-precise hardware documentation, but basically, the ALUs run much faster than the instruction decoder, so you want to reuse a decoded instruction for many threads. This apparently works out to a 4:1 ratio on the current chips, and so the warp size is 32. The warp scheduler decodes the instruction for the warp, and then packs the threads of the warp into the ALU pipeline. This gives it time to decode the next instruction for the next warp.
If your block has > 32 threads, then the next warp might be running the same instruction as the previous one, or not if you branch in your kernel. Or, if all of the threads in one block are waiting for global memory reads to finish, the warp scheduler will, assuming you have enough register/shared memory resources, switch to a warp in another block, which can be at a completely different instruction. (Note that blocks are never swapped in or out of multiprocessor while they are running. The kernel launcher figures out how many simultaneous blocks per multiprocessor you can run given your register and shared memory usage.)