Kernel Launch: number of blocks

Tonas · May 21, 2009, 4:50pm

Hi guys!

My thesis it’s about parallelization of matching algorithms, using GPUs. I’m using CUDA for that, but i have a doubt, when i launch a kernel with 1 block and 2 threads, and launch again the same kernel with 2 blocks and 1 thread, why execution time it’s better in the second case? It’s because CUDA launch two blocks simultaneously in two Cores of SM (Stream Multiprocessor)? or there’s any different reason for that? In other configurations, like, b:1 t:4 and b:4 t:1, it’s the same thing, it’s better with more blocks!

Other questions: 1:when i launch a kernel with only 16 threads, the SIMT (Single Instruction Multiple Threads) will only create a warp of 16 or one with 32 but only 16 execute? 2:When i launch more than 512 threads, like 2 blocks with 512 threads, with only one SM, the SM it’s only capable to execute 768 threads, what happen to the others? will they be put into a waiting queue? and when they have the necessary resources for their execution will they run? So why i have better results with 1024 threads then 768? It’s because there’snt any overhead of context changing?

Thanks a lot guys! I really appreciate if anyone help me clarify this! Sorry for my english <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

Jamie_K · May 21, 2009, 5:49pm

All of these are answered in the programming guide.

CUDA can run two blocks simultaneously, in two SMs.

CUDA can also run two threads simultaneously in a single SM, but when threads within a warp take different branches, due to “if” or other control flow statements, the warp executes the branches sequentially, not in parallel.

A warp has 32 threads. If there are fewer threads, the warp executes with some of the threads doing nothing.

If you have more blocks than can be simultaneously run on the SMs, then some blocks wait, and only execute after some of the other blocks finish.

Context changing is handled by the hardware and has effectively zero overhead. A SM will time-slice between the active warps but each has its own hardware (instruction pointer, registers, etc) so the switch takes no time.

Topic		Replies	Views
blocks vs threads and bad CUDA performance CUDA Programming and Performance	3	3558	January 23, 2015
Single thread blocks or single block with more thread ... CUDA Programming and Performance	4	4018	May 21, 2013
Using <<<...>>> CUDA Programming and Performance	6	2483	June 19, 2011
CUDA thread and SM CUDA Programming and Performance	1	943	September 30, 2021
Multiple kernels in flight? CUDA Programming and Performance	19	26882	August 28, 2007
thread vs block CUDA Programming and Performance	1	1373	July 9, 2009
Threaded CUDA Multiple concurrent kernels? CUDA Programming and Performance	9	5608	October 20, 2009
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17737	April 1, 2010
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2530	July 4, 2019
More blocks than SMs may not make sense CUDA Programming and Performance	13	2699	November 11, 2010

Kernel Launch: number of blocks

Related topics