Nvidia’s programming guide mentions “the maximum number of active blocks per multiprocessor”, which is 8 for my G92. But what does “active block” mean? I thought there “is” always only ONE block per multiprocessor and other block from the grid are “waiting” for multiprocessors to become idle. Please explain :)
no, no. There can be several blocks per multiprocessor. The rest do wait, tho.
Active blocks are blocks currently getting time slices from a multiprocessor. A multiprocessor can rapid-fire switch between 24 running warps, which can come from as many as 8 blocks (limited by the shared memory and register usage of each block). Inactive blocks have to wait for an active block to finish before they get any GPU time.
Well ok, suppose there are 1000 blocks, so 8 of them reside in each multiprocessor and one of them is computed and the other seven are waiting? Block are executed serially in each multiprocessor so only one block at time occupies registers, shared memory etc?
No, blocks are not executed serially on an MP. This is what everyone has been saying.
If each line is clock tick, this would be a typical instruction stream that the MP processes:
Block 0 warp 3
Block 3 warp 0
Block 2 warp 1
Block 0 warp 1
Block 1 warp 3
Block 2 warp 2
All warps in all block resident on the MP are actively consuming registers. This is all documented in the programming guide.
Ok so this means that, assuming CUDA 1.1, 8192 registers and 16KB of shared memory are shared by at most 8 blocks, so each block should use at most 1024 registers and 2KB shared memory. Did I get the point? :)
Not exactly, it is the other way around.
You have a kernel. That kernel uses x registers per thread and y bytes of shared memory per block. Then:
min(8, min(floor(8192 / (x * N_THREADS_PER_BLOCK)), floor(16384/y)))
gives you the amount of blocks that are concurrently running on a multiprocessor
Thanks. So this means that up to 8 blocks can reside in one MP, but only IF there are enough resources for all of them. Elsewhere the number of these active blocks decreases, so in worst case (most resource-demanding) there is only 1 active block in MP. Am I accurate now? :huh:
Yes, you are correct.
Alright. Thank you all for your time :)