(disclaimer: I’ve read chapters 2 and 3 both a few times…still having trouble understanding the details)
I’m trying to get a good handle on how the programming interface maps to the hardware and how the hardware executes programs/generally does its work.
1 - Reconciling the programming model against the execution model is giving me some difficulty and I think a good deal of the difficulty is due to the fact that I don’t entirely understand how “warp” and “block” are related. It seems that a warp is a subset of a block, but why is the term “half-warp” important? This must be related to how warps/half-warps are scheduled for execution on the processors? Is there a perhaps different or more in-depth explanation of how work gets scheduled onto the device or perhaps what parts of the programming model or a piece of code are blocks or how blocks and warps are related? …maybe with an example
I think I’d probably have a better idea of things if I could see a program and see exactly how work was mapped onto a specific card.
2 -For example, I have a G92-based 8800 GTS 512MB. According to the website on the device, I have 128 stream processors. I thought that I read somewhere that it has 16 multiprocessors, but maybe I"m wrong about that…that would, according to Figure 3-1 in the Programming Guide, imply that I have 8 “Processors” per “Multiprocessor”?
3 - I would assume at least that all Processors within a Multiprocessor must be executing the same kernel at the same time, but do all Multiprocessors also have to be executing the same kernel at the same time?
Section 3.2 of the Programming Guide seems particularly dense to me. I’m pretty sure with more work and looking through code examples and reading the forum, I’ll get it, but after looking at CUDA for the week that I have, it would be my #1 candidate for expansion/deeper explanation with examples.