A question the parallelization

casybaby · July 28, 2008, 3:01pm

Hi all,

This may be an old old topic for you. But it is still not very clear to me. 

We have multiprocessors in CUDA-enabled GPUs. Are all these multiprocessors run concurrently, or scheduled? I am aware that only 768 threads max can be run in one multiprocessor. So if I have more than 768 threads, then they will be split into several multiprocessors? Is it guaranteed that 768 threads can be run concurrently if I have assigned 768 threads for a kernel?

Let's say I limit the threads for one kernel within 768. Is there a way that I can run this kernel on several multiprocessors simultaneously? If so, how should I specify the multiprocessor for a kernel? Otherwise, is there a way to do so?

Thank you so much!!!!

E.D_Riedijk · July 28, 2008, 5:37pm

When calling a kernel you specify a grid size & a block size.
The block size determines how many threads there are per block. The grid size determines how many blocks you have. A block will run on 1 multiprocessor. A MP can run up to 8 blocks ‘at the same time’. A MP can handle at most 768 threads at the same time (1024 on GT200), but this depends heavily on the resource usage.

casybaby · July 28, 2008, 8:19pm

Denis,

Thanks for the reply. But it still not exactly the answer I am expecting for. I would like to know if MPs are run simultaneously. Maybe my question was not clear enough. Is there a way to run several kernels concurrently?

Thanks again.

E.D_Riedijk · July 28, 2008, 9:15pm

No, currently you cannot have more than 1 kernel running at the same time.

MP’s do not run. Blocks run (on MP’s). A block runs on 1 MP, but an MP can run more than 1 block at the same time. You just tell the kernel how may threads per block you want & how many blocks. CUDA takes care of running all those blocks in the most efficient way on the MP’s. If resources permit, several blocks will run on 1 MP.

Boxed_Cylon · July 29, 2008, 6:58am

Related to this discussion is what a warp is… Since your discussion of MP’s and threads makes sense to me, I wonder if you could comment on the accuracy of this paragraph below describing what a warp is and its purpose (this material mainly due to seibert - http://forums.nvidia.com/index.php?showtopic=57726 ) :

The warp size is the number of instructions that are run at any one time for a given MP.Â The purpose of warp size, which is specified by hardware, is to maximize the work of the stream processors.Â The reason for warp size is something like this:Â Each MP has just one instruction decoder, so each of the 8 (on current cards) stream processors needs to run the same instruction.Â Hence, the minimum warp size is 8.Â The stream processors are also pipelined, so for maximum efficiency 2 instructions need to be in flight to keep the pipeline stages busy. The same decoded instruction is therefore run again, but for another set of threads, which doubles the warp size up to 16.Â There is also a clock rate difference between the instruction decoder and the stream processors, which requires some extra time to decode the next instruction.Â The warp size can therefore be doubled again to 32.

The size of the warp on current hardware could not be reduced without causing parts of the chip to be underused.Â For a lot of calculations, a large warp size is an advantage, not a disadvantage.Â The warp size allows the GPU to maximize the execution of instructions, without spending too much time in instruction decoding.

I assume that the warp strategy is why one wants to have many more threads than stream processors? (I am adding a new section on my introductory document, so I’m trying to make this discussion as clear as possible.)

(Also, I surmise that the term “warp” comes from the textile industry - the threads on a loom (and nothing to do with the Starship Enterprise :) ))

E.D_Riedijk · July 29, 2008, 11:14am

you want more threads than stream processors because :

you need to run at least 4 threads per ALU for a warp
you need at least 6 warps (192 threads) to have no read after write register dependencies
you want to hide the memory-access latency

Topic		Replies	Views
A question the parallelization CUDA Programming and Performance	1	1206	July 28, 2008
A question about the CUDA's thread parallelization CUDA Programming and Performance	12	63102	January 25, 2009
threads per block / multi processor, contradiction ? CUDA Programming and Performance	5	1711	January 23, 2009
Distribution of Threads to Multiprocessors CUDA Programming and Performance	8	13665	June 8, 2011
Simple Questions Hard-to-find answers CUDA Programming and Performance	2	7558	March 9, 2011
Whats a WARP for? CUDA Programming and Performance	8	6545	June 21, 2007
Relationship between Warp, MP, Block, Shared Memory CUDA Programming and Performance	1	3467	March 29, 2010
Grids and No# of processors. Grids and No# of processors relation? CUDA Programming and Performance	6	8152	July 24, 2008
How to use blocks CUDA Programming and Performance	1	3598	November 26, 2007
How many concurrent threads? CUDA Programming and Performance	4	4873	June 7, 2008

A question the parallelization

Related topics