CUDA Programming Guide:
“The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.”
Does that actually mean that one instruction is issued by MP on every tick, and this instruction is “applied” to all 32 threads of the warp (if no branching occurs within warp)?
This leads us to the thought that it is minimum of 32 threads per block required to achieve speedup (in other words, running 32 threads in a block will take as much time as running 8 threads in a block, for instance).
True or false?