Single thread blocks or single block with more thread ...

Hi,

I had following case that,

kernel1<<<1,64>>>(param)

is slower than

kernel2<<<64,1>>>(param), where they are identical in operation.

Why kernel which has more single thread blocks work faster than the kernel which has single block with more thread?

Thank you.

Neither of these approaches os able to utilize more than just a fraction of the resources of your GPU, wasting most of them.

However, kernel2 can utilize multiple SMs for execution, while kernel1 only uses a single one.

That must be an unusual kernel. In the second case you have a severe case of arithmetic underutilization - only one SIMD lane is used. Number of warps running per SM, however, is not smaller.

I think this is a good question, and neither tera nor vvolkov really answered it.
I’m still battling to really understand the architecture, and I don’t have the answer to your question, but let me develop it…

A block is only executed by one streaming multiprocessor. If you launch more threads for a block than the number of streaming processors of this block, the excess threads will be hold by hardware threads of the streaming processors (I think they can load data), but have to wait for their turn to be computed. If you launch more threads in a block than the total number of hardware thread lying in all the streaming processors of a streaming multiprocessors, I guess the excess threads are put in some sort of queue.

Likewise, if you launch more blocks than the number of streaming multiprocessors that you have on your card, I think the excess blocks are also put in a queue and wait for their turn.

Now, if I’m right, the fact that <<<1,64>>> is slower than <<<64,1>>> would mean that you have a lot of streaming multiprocessors, with fewer streaming processors inside. As a result, launching a lot of threads in one block create sort of a longer queue of processes waiting for their turn, than when you launch a lot of blocks of only one thread…

Actually, I don’t know.
I just tried…
I’d like to see some seasoned CUDA programmer answering this question!

Because threads in different blocks run asynchronously and threads in same blocks are synchronized. It is quite common.

Note, however, that <<<64,1>>> is often same as <<<64,32>>> because you can’t have less than 1 warp per block.