I am afraid I don’t quite understand the basic operation of a GPU when executing CUDA kernels as described in the Programming guide. Chapter 3 nicely describes GPU implementation but in my opinion there are some contradictory states in it.
It says, for example, that:
- a block of threads is executed on a single multiprocessor (MP)
- a MP consists of 8 scalar streaming processors (32)
- each thread is executed by one SP
- the threads are executed in groups of 32, called warps
- a MP can execute up to 8 blocks concurrently
The guide says:
“Every instruction issue time, the SIMT unit selects a warp that is ready to execute
and issues the next instruction to the active threads of the warp. A warp executes
one common instruction at a time, so full efficiency is realized when all 32 threads
of a warp agree on their execution path.”
From what I understand, only 8 threads of the same block can actually run concurrently in the right sense of the word, and even that if they don’t diverge (because a block is not divided among MPs). So the upper sentence makes no sense. If my GPU consists of 12 MPs, then 12*8=96 threads actually run concurrently, but in case of total divergence, the number of concurrent threads becomes 12.
The meaning of warps is completely vague to me, because a warp is not the unit of concurrent execution but used only for scheduling.
I hope someone can enlighten me about the true state of affairs. I would recommend some sort of pseudo code of the thread/warp/block execution control program to be added to the guide, because it is the only true CUDA literature available (besides reference and tutorials, which are not concerned with the discussed information).
Thank you in advance.