I have just read the PTX ISA documentation. It says that Warp Size may be 16 or 32. I am wondering which factors will have effects on the warp size. Now I am using GF 210, which has 2 multiprocessor and 8 processor/mp. So if the warp size equals 32, how could GF 210 achieve SIMD in threads of a warp??
So far all CUDA devices have had a warp size of 32.
To execute 32 threads on 8 processors, each processor executes 4 threads - that’s why most instructions have a throughput of one instruction every four clock cycles.
Do you mean one clock for an instruction of a thread; in the case of 4 threads, 4 clocks are needed???
At least as far as throughput is concerned. It’s a bit more complicated if you go into the details. Latency is higher, as the processors are highly pipelined. But as latency is fully absorbed by scheduling different warps (as long as you have at least six warps running per SM), it is sufficient to look at the throughput numbers only.