CUDA on G80 hardware questions... Mapping the execution model to hardware

Hi, I have a number of questions about CUDA, because I am probably too stupid to figure them out :)


Q1:
The ‘CUDA Programming Guide’ states the 8800GTX as having 16 multiprocessors with 8 SIMD scalar SPs each, and by executing a kernel each thread block in the grid will be dedicated to a multiprocessor to use the local on-chip memory. Blocks cannot communicate with each other and get dedicated to a certain multiprocessor (this implies that there is a local on-chip memory per multiprocessor which makes perfect sense); then why does the ‘8800 Architecture Overview’ shows only 8 clusters (with 16SP’s) with local memory? (This would suggest that two blocks do can communicate?)

Q2:
As I understand you can only run 1 kernel concurrently on the hardware?

Q3:
I expect the multiprocessors to be able to follow different execution paths (As I can understand from the CUDA docs, even different WARPS should be EFFICIENTLY be able to follow different execution paths) i.e. allow BRANCHING inside the kernel based on WARP ID (= THREADID/WARPSIZE), and thus not suffering from the LOCKSTEP of the SIMD array? (easily explained; in one warp the 'IF-clause gets executed, in the other warp the ‘ELSE’-clause gets executed, but not both gets executed with an execution mask hiding the writes of the ‘not needed code’)

Q4:
IF (and I say IF, because I am aware of the importance of UNIFORM streaming) so IF I wanted to bypass this ‘only 1 kernel execution’ can I do the following? Since each block gets dedicated to a (SPMD) multiprocessor, you can look at a block as if being a data-parallel ‘subkernel’. That way; can I implement different subkernel functionalities (i.e. write a number of IF-statements in the kernel based on BLOCKID) to run in parallel on the CUDA architecture? Easy example: In one block I want to increment values, in the other I want to decrement; I write a ‘CUDA kernel’ that checks with two IF-clauses the BLOCKID, and based on those ID’s we start incrementing or decrementing values; Since the G80 has more than 2 multiprocessors, the blocks will run in PARALLEL. So these different functionalities are physically distributed, ergo realizing true SPMD processing?

Q5:
What if the ‘decrement subkernel’ from the previous question takes a lot longer than the incrementation (let’s say we decrement 5 times and increment only once). Is there a specific problem? Or will the GigaThread execution manager automatically start a block (if there are more than two blocks off course - e.g. 100 blocks that increment and 100 blocks that decrement) whenever the multiprocessor is available?

Q6:
Following this SPMD concept, different threads can use a different amount of registers. Then why say (#threads * #reg/threads < MAX )? Why not just the sum of all registers used? This would suggest that all threads WILL get the same amount of registers? What’s up with that (with warps being able to follow different execution paths and all)?

Q7:
I noticed that the WARP SIZE is 32. Since we have 8 SPs per multiprocessor, I expect there is a hardware TIME MULTIPLEXER that EMULATES the 4 component vectors needed for graphics?


I think this is about it :P for the sake of clarity, could you please answer with A1, A2 etc… :) otherwise it will get messy … no problem if you only have an answer to a single question :) I am happy with all your comments.

Thanks in advance!

A1:
I noticed this transposition as well in the Wikipedia article on CUDA. I think the CUDA documentation is more likely to be correct though: 16 multiprocessors with 8 processors in each multiprocessor.

Despite the 8 vs. 16 numerical confusion, two blocks still can’t communicate through local memory. With compute capability 1.1 GPUs (8500/8600), two blocks can communicate through the global memory.

A2:
Correct. One kernel at a time.

A3:
Yes, this is mentioned in the manual. Branching on warp boundaries is fine.

A4:
Yes, this is sometimes called a “fat kernel.” The drawback to this method is that you must start and stop all tasks at the same time (since they are part of the same master kernel).

A5:
I don’t know enough about the block scheduling to answer this. I suspect getting full utilization of the hardware with a fat kernel will be difficult no matter what.

A7:
The warp size is larger than the number of processors because two issues. First, the stream processors are clocked higher than the GPU core clock (1.35 GHz vs. 575 MHz), so you want to run the same decoded instruction on multiple data elements. Second, the instruction execution is pipelined, so it takes two clocks to finish one instruction, but you can have multiple instructions “in flight” in the pipeline.

The stream processors operate on scalars natively, so the warp size is not intended to emulate a vector processor. The nvcc compiler lets you overload operators, so it is possible to still write code where you add float3 types, and so on. But the compiler decomposes everything to scalar operations ultimately.

Thank you (for your time and efforts)! These already confirms a lot of my ideas :-) (other people or still free to comment their vision)

But we still need the definitive answers to Q1, Q5 and Q6. Surely the guys from NVIDIA should easily know A1… anyone? ;-)

I am sorry but I do not understand your remarks in A7. I get the shader clock and core clock discrepancies, and using the same decoded instruction multiple times, but I don’t understand you remarks about ‘instruction execution is pipelined’. What do you mean with that?

Thanks again!

Q1: 16 multiprocessors, 8 SPs each.

Modern CPUs need more than one clock cycle to execute most instructions. However, if they break the execution into stages, they can have several instructions “in flight” at the same time. In a given clock cycle, one instruction can be at stage 1, and another at stage 2, and so on. One instruction can finish in every clock cycle, but a given instruction might take many clocks to finish from beginning to end. See http://en.wikipedia.org/wiki/Pipelining for more info.

Paulius, thanks for confirmation, but then why does the architecture documentation shows groups of ‘2 multiprocessors’ that share local on-chip memory (this is not consistent with the CUDA execution model) Thanks!

Not sure, can you give me a specific reference (with version and page numbers)?

Paulius

GeForce_8800_GPU_Architecture_Technical_Brief.pdf

(version nov 8 2006)

page number 13

they speak of a ‘Parallel Data Cache (PDC)’ … per 16 SPs and on next page (p14) they show (and tell) how the ‘shared data’ is stored inside the PDC.

QUOTE from the PDF:

But I am beginning to suspect that the PDC is not the ‘shared memory’ but rather is a L2 Instruction & Data cache as shown on other diagrams, and the shared mem is located a level closer to the SPs (and only being available per 8SPs). I have noticed that the combination of 2 multiprocessors is often called a Texture Processor Cluster (TPC), as shown in here (slide 14).

So I am by now pretty sure that the ‘PDC’ is not the ‘shared mem’ … If so; Then what is this PDC? Is it the L2 I&D cache? And for what is it used? Since global and local memory accesses aren’t cached? Thanks!

The PDC is the same as the shared memory. There are 16 multiprocessors with 8 scalar processors each. Each multiprocessor has 16KB of shared memory.

GeForce_8800_GPU_Architecture_Technical_Brief.pdf is a marketing document intended for digestion by semi-technical press and consumers. It is therefore oversimplified. The CUDA programming guide is the technical document that you should use as a reference.

Mark

Okay, thanks for clearing that. It’s a little bit confusing the ‘cache’ representing the shared memory. But to extend my previous post; Then what does this L2 I&D Cache exactly do (since global/local memory is not cached)?

Isn’t it that the technical brief speaks of 8 units with 16 SP’s each, because of 2 MP’s being in a texture processing cluster? Like on slide 5 of these slides (from your chief scientist, not your chief markteer :D ) http://courses.ece.uiuc.edu/ece498/al1/lec…%20hardware.ppt

It will make no difference to the programming in CUDA in general, but it might be that threads of 2 blocks that happen to share the same texture processing unit, will have a higher cache-hit ratio if they access values close together?

And slide 7 speaks of 2 Super Function Units per MP, from later slides it looks like these perform the interpolation for texture fetches?

<Hmm, hadn’t seen the earlier reference to another set of slides from the same series of lectures. But they certainly are very interesting reading>