About the SM multiprocessor processing the wraps in block

The current GPU is tesla K80. Where is the detailed documentation to tell me the maximum number of blocks that can be controlled by one SM(stream multiprocessor)? Only know that the maximum number of threads processed by one SM is 2048, and the maximum number of threads per block is 1024.
Background : modify a two-layer loop i10.0.0.0.1 192.168.1.254 hotmail.com
nto a GPU parallel mode, the inner and outer loops are 320
Question 1 :
Set the block number Block (20, 20, 1), the number of threads per block (16, 16, 1), for tesla K80, a total of 2x13 = 26 sm, with 32 threads as a wrap, where each block can be divided into 8 wraps, then the total number of wraps for this multiprocessor is 20x20x8=3200 wraps, but I only have 26 sm, then the 26 sm starts with 1 wrap in 26 blocks(1wrap/1block), or 3 blocks 24 wrap+1 block 2 wrap ? Knowing that sm handles wraps one by one, how does he achieve so many threads of parallelism?
Question 2 :
For the same 320 2-layer loop, the number of blocks (20, 20, 1), the number of threads (16, 16, 1); the number of blocks (40, 40, 1), the number of threads (8, 8, 1);
The first case has a total of 400 blocks, each of which can be divided into 8 wraps; the second case has a total of 1600 blocks, each of which can be divided into 2 wraps;
The number of threads in each wrap is full 32, the total number of wraps is the same, but the time they measured is different, the second time is less, why is this? How was it caused?

Knowing that sm handles wraps one by one, how does he achieve so many threads of parallelism?

Maybe I’m not following the question, but a K80 can handle up to 64 warps (2048 threads) per SM (depending on occupancy)

The number of threads in each wrap is full 32, the total number of wraps is the same, but the time they measured is different, the second time is less, why is this? How was it caused?

This would be highly dependent on loop so not something that can be answered generally. I suggest you try using the Nsight-compute profiler can compare the performance between the two cases. It should give you a better idea as to what’s going on.

-Mat