About the SM multiprocessor processing the wraps in block

The current GPU is tesla K80. Where is the detailed documentation to tell me the maximum number of blocks that can be controlled by one SM(stream multiprocessor)? Only know that the maximum number of threads processed by one SM is 2048, and the maximum number of threads per block is 1024.
Background : modify a two-layer loop into a GPU parallel mode, the inner and outer loops are 320
Question 1 :
Set the block number Block (20, 20, 1), the number of threads per block (16, 16, 1), for tesla K80, a total of 2x13 = 26 sm, with 32 threads as a wrap, where each block can be divided into 8 wraps, then the total number of wraps for this multiprocessor is 20x20x8=3200 wraps, but I only have 26 sm, then the 26 sm starts with 1 wrap in 26 blocks(1wrap/1block), or 3 blocks 24 wrap+1 block 2 wrap ? Knowing that sm handles wraps one by one, how does he achieve so many threads of parallelism?
Question 2 :
For the same 320 2-layer loop, the number of blocks (20, 20, 1), the number of threads (16, 16, 1); the number of blocks (40, 40, 1), the number of threads (8, 8, 1);
The first case has a total of 400 blocks, each of which can be divided into 8 wraps; the second case has a total of 1600 blocks, each of which can be divided into 2 wraps;
The number of threads in each wrap is full 32, the total number of wraps is the same, but the time they measured is different, the second time is less, why is this? How was it caused?

Hi M_yeah,

Where is the detailed documentation to tell me the maximum number of blocks that can be controlled by one SM(stream multiprocessor)?

There are a maximum of 16 blocks per SM on a K80 (cc3.7). See the table on page 7: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf


how does he achieve so many threads of parallelism?

There are 192 single precision and 64 double precision cores per SM. Each core is pipelined so multiple threads per core will be in flight. Pages 8-10 of the above white paper gives more details.

The number of threads in each wrap is full 32, the total number of wraps is the same, but the time they measured is different, the second time is less, why is this? How was it caused?

There are many factors that may help or hurt performance.

For occupancy, register usage is also a factor. So while in the second case you can only have a maximum occupancy of 50%, this does allow for more registers per thread (64). In the first case, if the register usage was higher (>32 per thread), then the code may not be achieving 100% occupancy.

Also, high occupancy does not necessarily lead to better performance. Other factors such as memory layout, stride-1 data access, caching, precision, branch divergence, instruction-level parallelism, etc., can effect performance.

To understand the differences, I would suggest using nvprof/pgprof to perform a detailed analysis (i.e. with metrics enabled) of each case.

Hope this helps,
Mat