A question the parallelization

Hi all,

This may be an old old topic for you. But it is still not very clear to me. 

We have multiprocessors in CUDA-enabled GPUs. Are all these multiprocessors run concurrently, or scheduled? I am aware that only 768 threads max can be run in one multiprocessor. So if I have more than 768 threads, then they will be split into several multiprocessors? Is it guaranteed that 768 threads can be run concurrently if I have assigned 768 threads for a kernel?

Let's say I limit the threads for one kernel within 768. Is there a way that I can run this kernel on several multiprocessors simultaneously? If so, how should I specify the multiprocessor for a kernel? Otherwise, is there a way to do so?

Thank you so much!!!!

When calling a kernel you specify a grid size & a block size.
The block size determines how many threads there are per block. The grid size determines how many blocks you have. A block will run on 1 multiprocessor. A MP can run up to 8 blocks ‘at the same time’. A MP can handle at most 768 threads at the same time (1024 on GT200), but this depends heavily on the resource usage.


Thanks for the reply. But it still not exactly the answer I am expecting for. I would like to know if MPs are run simultaneously. Maybe my question was not clear enough. Is there a way to run several kernels concurrently?

Thanks again.

No, currently you cannot have more than 1 kernel running at the same time.

MP’s do not run. Blocks run (on MP’s). A block runs on 1 MP, but an MP can run more than 1 block at the same time. You just tell the kernel how may threads per block you want & how many blocks. CUDA takes care of running all those blocks in the most efficient way on the MP’s. If resources permit, several blocks will run on 1 MP.

Related to this discussion is what a warp is… Since your discussion of MP’s and threads makes sense to me, I wonder if you could comment on the accuracy of this paragraph below describing what a warp is and its purpose (this material mainly due to seibert - http://forums.nvidia.com/index.php?showtopic=57726 ) :

I assume that the warp strategy is why one wants to have many more threads than stream processors? (I am adding a new section on my introductory document, so I’m trying to make this discussion as clear as possible.)

(Also, I surmise that the term “warp” comes from the textile industry - the threads on a loom (and nothing to do with the Starship Enterprise :) ))

you want more threads than stream processors because :

  1. you need to run at least 4 threads per ALU for a warp
  2. you need at least 6 warps (192 threads) to have no read after write register dependencies
  3. you want to hide the memory-access latency