Simple Questions Hard-to-find answers

LucasCampos · March 9, 2011, 4:50pm

Hello there. I’m learning CUDA, and I cannot find the answer for some simple questions.

First, a block of threads is assigned to each core or to each multiprocessor? For instance, I have a GTX 465. It has 11Multiprocessors and 32 Cores on each MP, that gives 352 cores. Would it be better to have eleven blocks ou 352 ones? There’s a example of this in the Programming Guide, but it doesn’t say anything about how MPs relate to all this.

Second, if I call two kernels, one followed by the other, the synchronize automatically or I have to call cudaThreadSynchronize()?

Ex:

kernel1 <<< nblocks, nthreads>>> (var1,var2,var3); //say that var 3 is changed here 

// Do I have to put the cudaThreadSynchronize() here?

kernel2 <<< nblocks, nthreads >>>(var1,var3,var4,var5);

Notice that there’s some dependecy here.

Thanks in advance.

ceearem · March 9, 2011, 7:32pm

A block runs on a multiprocessor, a thread runs on a core of an multiprocessor.
But it is generally good practise to overload multiprocessors either by running large blocks (with many threads say 256 or 512) or many smaller blocks. In the end you want to have something like 8 warps (32 threads) per multiprocessor, so in your instance with 11 multriprocs you will want to start 11832 threads totoal (roughly 3000 threads) distributed over at least 11 blocks. It is also advisable to either have many more blocks than there are multiproccesors or a multiple of the number of multiprocs.

Kernels which are not explicitely run through different “streams” (which they are not in your example) are not executed in parallel. A kernel has to wait until a already running kernel is finished. Hence between your two kernels there is an implicit ThreadSynchronize just before your second kernel.

Ceearem

LucasCampos · March 9, 2011, 8:13pm

So, should I have at least 2816 threads, over 11*n blocks, where n is a natural number, to achieve maximum performance? Are the warps automatically divided?

Topic		Replies	Views
Using <<<...>>> CUDA Programming and Performance	6	2476	June 19, 2011
finding the best number of threads per block CUDA Programming and Performance	3	7846	January 29, 2010
A question the parallelization CUDA Programming and Performance	5	2694	July 29, 2008
2 blocks versus 3 blocks CUDA Programming and Performance	5	4917	August 3, 2009
Synchronizing Blocks CUDA Programming and Performance	3	2408	January 10, 2018
A question the parallelization CUDA Programming and Performance	1	1184	July 28, 2008
Mapping between CUDA cores and threads CUDA Programming and Performance	7	15369	December 2, 2011
General Formula for Thread/Block Ratio CUDA Programming and Performance	1	590	June 2, 2011
threads per block / multi processor, contradiction ? CUDA Programming and Performance	5	1656	January 23, 2009
Cuda Cores Cuda Cores - run threads bloocks, kernels etc. CUDA Programming and Performance	5	1739	February 22, 2011

Simple Questions Hard-to-find answers

Related topics