In G80, a warp size is 32, half warp is the basic single instruction multiple data unit, which means that every 16 threads execute in lock step. My guess is that there are 8 streaming processors in every SMP, and each one is hyperthreaded by 2, 8*2 = 16. The question is, is size of the SIMD unit dependent on hardware or CUDA computing capability? One of my collegueas said that it is dependent on software the CUDA version. I am not sure if he is right.
That’s not true, half warps are only relevant to some sorts of memory accesses, not thread execution. The SIMD vector width is effectively 32.
SPs are not hyper-threaded, MPs are. When an MP issues an instruction to a warp, each of its eight SPs will sequentially execute this instruction 4 times so that 32 threads are covered. When a warp stalls (due to a high latency memory access for example), the MP has many other warps ready and switches to executing a non-stalled one (sort of like hyperthreading).
You might ask why warp size is not 8 then. I’m not sure about that, I didn’t design the hardware :) but I believe the MP instruction issue hardware would have to be very fast in that case. In current implementations, issuing an instruction to the SPs takes 4 clock cycles, which means that if we don’t want the scheduler to be the bottleneck the SPs should process that instruction for at least 4 cycles, meaning 4 sequential executions (if the instruction takes 1 SP cycle to execute).
It could be, that is NVIDIA keeps saying not to make assumptions about warp size being 32 in our algorithms. But all current and announced GPUs (including Fermi) have warp size of 32, regardless of compute capability or CUDA version. There’s no combination of hardware and software that gives a different warp size. This may change in the future.
But it processes to warps simultaneously per SM (MP), so it’s 16 SPs per warp.