Hi, I have a number of questions about CUDA, because I am probably too stupid to figure them out :)
The ‘CUDA Programming Guide’ states the 8800GTX as having 16 multiprocessors with 8 SIMD scalar SPs each, and by executing a kernel each thread block in the grid will be dedicated to a multiprocessor to use the local on-chip memory. Blocks cannot communicate with each other and get dedicated to a certain multiprocessor (this implies that there is a local on-chip memory per multiprocessor which makes perfect sense); then why does the ‘8800 Architecture Overview’ shows only 8 clusters (with 16SP’s) with local memory? (This would suggest that two blocks do can communicate?)
As I understand you can only run 1 kernel concurrently on the hardware?
I expect the multiprocessors to be able to follow different execution paths (As I can understand from the CUDA docs, even different WARPS should be EFFICIENTLY be able to follow different execution paths) i.e. allow BRANCHING inside the kernel based on WARP ID (= THREADID/WARPSIZE), and thus not suffering from the LOCKSTEP of the SIMD array? (easily explained; in one warp the 'IF-clause gets executed, in the other warp the ‘ELSE’-clause gets executed, but not both gets executed with an execution mask hiding the writes of the ‘not needed code’)
IF (and I say IF, because I am aware of the importance of UNIFORM streaming) so IF I wanted to bypass this ‘only 1 kernel execution’ can I do the following? Since each block gets dedicated to a (SPMD) multiprocessor, you can look at a block as if being a data-parallel ‘subkernel’. That way; can I implement different subkernel functionalities (i.e. write a number of IF-statements in the kernel based on BLOCKID) to run in parallel on the CUDA architecture? Easy example: In one block I want to increment values, in the other I want to decrement; I write a ‘CUDA kernel’ that checks with two IF-clauses the BLOCKID, and based on those ID’s we start incrementing or decrementing values; Since the G80 has more than 2 multiprocessors, the blocks will run in PARALLEL. So these different functionalities are physically distributed, ergo realizing true SPMD processing?
What if the ‘decrement subkernel’ from the previous question takes a lot longer than the incrementation (let’s say we decrement 5 times and increment only once). Is there a specific problem? Or will the GigaThread execution manager automatically start a block (if there are more than two blocks off course - e.g. 100 blocks that increment and 100 blocks that decrement) whenever the multiprocessor is available?
Following this SPMD concept, different threads can use a different amount of registers. Then why say (#threads * #reg/threads < MAX )? Why not just the sum of all registers used? This would suggest that all threads WILL get the same amount of registers? What’s up with that (with warps being able to follow different execution paths and all)?
I noticed that the WARP SIZE is 32. Since we have 8 SPs per multiprocessor, I expect there is a hardware TIME MULTIPLEXER that EMULATES the 4 component vectors needed for graphics?
I think this is about it :P for the sake of clarity, could you please answer with A1, A2 etc… :) otherwise it will get messy … no problem if you only have an answer to a single question :) I am happy with all your comments.
Thanks in advance!