That must be an unusual kernel. In the second case you have a severe case of arithmetic underutilization - only one SIMD lane is used. Number of warps running per SM, however, is not smaller.
I think this is a good question, and neither tera nor vvolkov really answered it.
I’m still battling to really understand the architecture, and I don’t have the answer to your question, but let me develop it…
A block is only executed by one streaming multiprocessor. If you launch more threads for a block than the number of streaming processors of this block, the excess threads will be hold by hardware threads of the streaming processors (I think they can load data), but have to wait for their turn to be computed. If you launch more threads in a block than the total number of hardware thread lying in all the streaming processors of a streaming multiprocessors, I guess the excess threads are put in some sort of queue.
Likewise, if you launch more blocks than the number of streaming multiprocessors that you have on your card, I think the excess blocks are also put in a queue and wait for their turn.
Now, if I’m right, the fact that <<<1,64>>> is slower than <<<64,1>>> would mean that you have a lot of streaming multiprocessors, with fewer streaming processors inside. As a result, launching a lot of threads in one block create sort of a longer queue of processes waiting for their turn, than when you launch a lot of blocks of only one thread…
Actually, I don’t know.
I just tried…
I’d like to see some seasoned CUDA programmer answering this question!