Single thread blocks or single block with more thread ...

samarawickrama · May 8, 2013, 12:30pm

Hi,

I had following case that,

kernel1<<<1,64>>>(param)

is slower than

kernel2<<<64,1>>>(param), where they are identical in operation.

Why kernel which has more single thread blocks work faster than the kernel which has single block with more thread?

Thank you.

tera · May 8, 2013, 2:24pm

Neither of these approaches os able to utilize more than just a fraction of the resources of your GPU, wasting most of them.

However, kernel2 can utilize multiple SMs for execution, while kernel1 only uses a single one.

vvolkov · May 9, 2013, 4:41am

That must be an unusual kernel. In the second case you have a severe case of arithmetic underutilization - only one SIMD lane is used. Number of warps running per SM, however, is not smaller.

Florimond · May 20, 2013, 10:35pm

I think this is a good question, and neither tera nor vvolkov really answered it.
I’m still battling to really understand the architecture, and I don’t have the answer to your question, but let me develop it…

A block is only executed by one streaming multiprocessor. If you launch more threads for a block than the number of streaming processors of this block, the excess threads will be hold by hardware threads of the streaming processors (I think they can load data), but have to wait for their turn to be computed. If you launch more threads in a block than the total number of hardware thread lying in all the streaming processors of a streaming multiprocessors, I guess the excess threads are put in some sort of queue.

Likewise, if you launch more blocks than the number of streaming multiprocessors that you have on your card, I think the excess blocks are also put in a queue and wait for their turn.

Now, if I’m right, the fact that <<<1,64>>> is slower than <<<64,1>>> would mean that you have a lot of streaming multiprocessors, with fewer streaming processors inside. As a result, launching a lot of threads in one block create sort of a longer queue of processes waiting for their turn, than when you launch a lot of blocks of only one thread…

Actually, I don’t know.
I just tried…
I’d like to see some seasoned CUDA programmer answering this question!

vvolkov · May 21, 2013, 1:46am

Because threads in different blocks run asynchronously and threads in same blocks are synchronized. It is quite common.

Note, however, that <<<64,1>>> is often same as <<<64,32>>> because you can’t have less than 1 warp per block.

Topic		Replies	Views
Kernel Launch: number of blocks CUDA Programming and Performance	1	1746	May 21, 2009
thread vs block CUDA Programming and Performance	1	1409	July 9, 2009
More blocks than SMs may not make sense CUDA Programming and Performance	13	2881	November 11, 2010
How they work betweem SM and block SM, SP, Block, Thread and so on. CUDA Programming and Performance	1	4361	January 8, 2008
Which entity will execute one block? A single Cuda core or a SM? CUDA Programming and Performance	13	17281	December 7, 2010
Basic Cuda Confusion - help CUDA Programming and Performance	9	2007	February 11, 2013
What will be happen in the situation CUDA Programming and Performance	9	6345	December 23, 2008
Number of Threads vs Number of Blocks in GPU Kernel CUDA Programming and Performance	4	8813	July 16, 2017
number of blocks and threads CUDA Programming and Performance	8	1385	August 18, 2010
2 blocks versus 3 blocks CUDA Programming and Performance	5	4992	August 3, 2009

Single thread blocks or single block with more thread ...

Related topics