I have a question regarding warps in CUDA.
My understanding is that, the GPU device has many multiprocessors (say N) and each multiprocessor has several processors (say M).
So, if we load a kernal on to the device, it is executed as a grid of thread blocks. These blocks are sheduled for execution on the multiprocessors.
The active blocks are then split into groups of threads called warps which are executed simultaneously.
Since the maximum number of processors on a single multiprocessor is M, shouldn’t the warp size always be M?
Can anyone please clarify this?