This way, 32000 blocks / 32 blocks per SM = 1000 SMs would be necessary to run all blocks concurrently. Considering all SMs are used in the Pascal GPU, 980 blocks would be stalled waiting to be processed, then 960, and so on.
Not quite understanding this since you seem to switch units from 1000 SMs to 1000 blocks.
Taking your example of 32000 blocks on a P100. You can have 32 blocks running concurrently on each SM. So with 20 SMs, that means you’ll have 640 blocks running at the same time and it will take 50 passes to process all 32,000 blocks.
But what it has to do with the 32 threads per block, how this affects the application?
In addition to running 32 blocks per SM, each SM can run up to 2048 threads. But if you’re limiting your block to use only 32 threads, this means you’re only using half of the threads (1024) that could be run.
So let’s take your example of 32,000 blocks but think of it in terms of threads. At 32 threads per block, this means a total of 1,024,000 threads, with a max of 20,480 running concurrently (640 blocks x 32 threads), and 50 passes to complete. But a P100 is capable to running up 40,960 concurrent threads. So to take advantage of using all the available threads, you must have a block with at least 64 threads.
Using a fixed number of 1,024,000 threads, setting your block size to 64, means only 16,000 blocks are needed, and only 25 passes to processes.
Keeping the block size fixed at 32,000, but increasing the block size to 64 threads, this means you now can run 2,048,000 threads. It will still take 50 passes, but you’ve doubled the number of threads that can work on the problem either reducing the time it takes to run each block, or allowing you to perform more work per block.
Note that this is very simplified and generalized. For specific questions about you’re actual code, please provide a reproducing example.
-Mat