(I’m in compute capability 1.3)
I know that there is 8 processors per multiprocessor and each processor process an instruction every 4 cycles. (Is it right?)
So, I think, 32 threads(that is a warp) can be parallelized.
It is said that the physical limit of active thread number per multiprocessor is 1024. (Equivalently, the limit of active warp number is 32)
Then, how a multiprocesser can process 32 warps at once?
By interleaving execution. While some warps are inactive waiting for data from memory, other can be executing on the device. The programming guide goes over this fairly well in the introductory chapters.
Thank you for you help.