Kernel Launch: number of blocks

Hi guys!

My thesis it’s about parallelization of matching algorithms, using GPUs. I’m using CUDA for that, but i have a doubt, when i launch a kernel with 1 block and 2 threads, and launch again the same kernel with 2 blocks and 1 thread, why execution time it’s better in the second case? It’s because CUDA launch two blocks simultaneously in two Cores of SM (Stream Multiprocessor)? or there’s any different reason for that? In other configurations, like, b:1 t:4 and b:4 t:1, it’s the same thing, it’s better with more blocks!

Other questions: 1:when i launch a kernel with only 16 threads, the SIMT (Single Instruction Multiple Threads) will only create a warp of 16 or one with 32 but only 16 execute? 2:When i launch more than 512 threads, like 2 blocks with 512 threads, with only one SM, the SM it’s only capable to execute 768 threads, what happen to the others? will they be put into a waiting queue? and when they have the necessary resources for their execution will they run? So why i have better results with 1024 threads then 768? It’s because there’snt any overhead of context changing?

Thanks a lot guys! I really appreciate if anyone help me clarify this! Sorry for my english <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

All of these are answered in the programming guide.

CUDA can run two blocks simultaneously, in two SMs.

CUDA can also run two threads simultaneously in a single SM, but when threads within a warp take different branches, due to “if” or other control flow statements, the warp executes the branches sequentially, not in parallel.

A warp has 32 threads. If there are fewer threads, the warp executes with some of the threads doing nothing.

If you have more blocks than can be simultaneously run on the SMs, then some blocks wait, and only execute after some of the other blocks finish.

Context changing is handled by the hardware and has effectively zero overhead. A SM will time-slice between the active warps but each has its own hardware (instruction pointer, registers, etc) so the switch takes no time.