Scheduling Thread Blocks

https://docs.nvidia.com/cuda/turing-tuning-guide/index.html#sm-occupancy

I’m using Quadro RTX 6000 (Turing).

In the picture above, the maximum number of concurrent warps per SM is 32 means maximum threads per SM is 1024 right?

And

If the maximum number of thread blocks per SM is 16 , sum of 16 blocks’s threads on one SM should be 1024 or just need to have less than 1,024 threads on the only block currently running?

Yes, and this information is available to you already in the programming guide, in table 15.

It means that you can have a maximum of 16 threadblocks currently executing in an SM. It does not abrogate other hardware limits. You are also limited (on that device) to a maximum of 1024 threads. Therefore, if you actually wanted to witness 16 threadblocks deposited on a single SM, those threadblocks could have no more than 64 threads. If you launched a kernel with 128 threads per block, you could witness at most 8 blocks on a single SM, even though the hardware limit for that value is 16.

This is the nature of occupancy. All relevant hardware limits must be satisfied.

Can blocks assigned by one SM be from different kernels?

Or only have blocks of the same kernel?

They can be from different kernels.

Then, the blocks execute in warp units.

Can warp of different blocks (different kernel) run on one SM at the same time?

Yes. This is all assuming those kernels are issued from the same application/process, or that MPS is being used.