threads in one block

If the number of threads in one block is less than 32 (warp) , how are they executed? How does the SIMT unit handle this situation? Thanks.

launches one warp with some threads masked off

I still don’t quite understand. For example, if a block has 8 threads on a SM with 8 scalar processor , how many clocks does the SIMT unit need to issue one instruction to all threads in that block? 1 clock or 4 clocks? The size of the warp is 8 or 32? Thanks

The size of a warp is 32, and one instruction for a warp finishes in 4 clocks. Since that is the unit of scheduling, if you only use 8 threads in a block, they are put into a warp of 32 (with a bunch of empty slots masked) and still take 4 clocks to complete one instruction.

Thanks, your explanation is very clear.

I think the basic excute unit in a block is not warp but half-warp (16 threads).

Basic execute unit in a block is a warp, because there are 8 cores in SM and each handles 4 threads.
Half-warps are used for memory transactions.

Hi, any references for “each handles 4 threads”?