Simple Questions Hard-to-find answers

Hello there. I’m learning CUDA, and I cannot find the answer for some simple questions.

First, a block of threads is assigned to each core or to each multiprocessor? For instance, I have a GTX 465. It has 11Multiprocessors and 32 Cores on each MP, that gives 352 cores. Would it be better to have eleven blocks ou 352 ones? There’s a example of this in the Programming Guide, but it doesn’t say anything about how MPs relate to all this.

Second, if I call two kernels, one followed by the other, the synchronize automatically or I have to call cudaThreadSynchronize()?

Ex:

kernel1 <<< nblocks, nthreads>>> (var1,var2,var3); //say that var 3 is changed here 

// Do I have to put the cudaThreadSynchronize() here?

kernel2 <<< nblocks, nthreads >>>(var1,var3,var4,var5);

Notice that there’s some dependecy here.

Thanks in advance.

A block runs on a multiprocessor, a thread runs on a core of an multiprocessor.
But it is generally good practise to overload multiprocessors either by running large blocks (with many threads say 256 or 512) or many smaller blocks. In the end you want to have something like 8 warps (32 threads) per multiprocessor, so in your instance with 11 multriprocs you will want to start 11832 threads totoal (roughly 3000 threads) distributed over at least 11 blocks. It is also advisable to either have many more blocks than there are multiproccesors or a multiple of the number of multiprocs.

Kernels which are not explicitely run through different “streams” (which they are not in your example) are not executed in parallel. A kernel has to wait until a already running kernel is finished. Hence between your two kernels there is an implicit ThreadSynchronize just before your second kernel.

Ceearem

So, should I have at least 2816 threads, over 11*n blocks, where n is a natural number, to achieve maximum performance? Are the warps automatically divided?