CUDA Programming Guide:
“The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.”
Does that actually mean that one instruction is issued by MP on every tick, and this instruction is “applied” to all 32 threads of the warp (if no branching occurs within warp)?
This leads us to the thought that it is minimum of 32 threads per block required to achieve speedup (in other words, running 32 threads in a block will take as much time as running 8 threads in a block, for instance).
thread here discusses branch instructions. It was said:
Having 8 scalar cores in MP, and 32 threads in a warp, we can get 4 instructions issued on each scalar core (that is, one instruction per one thread in a warp).
True or false?
Nevertheless it is only instruction issue efforts and can be neglected compared to instruction execution efforts. On the other hand, execution efforts ARE instructions to be issued. Wow, now it’s a mess. Can someone explain the process in details, please? External Media
Neither. It takes 4 clock cycles to issue an instruction to all 32 threads in a non-divergent warp. 8 cores x 4 cycles = 32 instructions. So effectively, the MP instruction scheduler is issuing a new instruction at a maximum rate of one per four cycles. Not every instruction can be retired in one clock. Double precision instructions on current hardware must take a minimum of 8 cycles to retired (the double precision FPU is shared by all 8 scalar cores).
In Fermi, yes - there are now four times as many double precision capable FPUs per multiprocessor. So that theoretical 8 cycle minimum drops to two cycles for Fermi. The double precision retire rate is only the most obvious example. I am fairly sure that one of the NVIDIA people that posts here (maybe Simon Green or Mark Harris, I can’t remember off the top of my head) has hinted in the past that not all instructions execute in a single cycle. The programming guide notes that full 32 bit integer multiply is a very costly instruction on current hardware, for example, which is why there is the __mul24() version, which does execute in a single cycle. It also says that should change in future hardware.