I read that it is good to have at least 2 thread groups per multiprocessor so that if one stalls, it can work on another to hide latency. However, doesn’t the multiprocessor work on a warp at a time. So if you had one thread group with many warps, isn’t that enough to hide latency, as the multiprocessor can work on a different warp if one stalls? Or is the idea to have at least 2 thread groups per multiprocessor to ensure there are enough warps for the multiprocessor to work on?
During a __syncthreads() the number of active warps in a block will decrease only a single warp is left.
Similarly, memory stalls will likely happen at the same time for all warps in a block.