The statement is that " To guarantee co-residency of the thread blocks on the GPU, the number of blocks launched needs to be carefully considered. For example, a block per SM can be launched as follows:"
// initialize, then launch
cudaLaunchCooperativeKernel((void*)my_kernel, deviceProp.multiProcessorCount, numThreads, args);
I understand that this piece of code schedules deviceProp.multiProcessCount thread blocks on a device and the size of each thread block is numThreads. I could not understand why the above code snippet will guarantee that there will be one thread block scheduled on each SM and exclude the case that multiple thread blocks are scheduled to some SMs and the rest SMs are idle.