I knew that 1 SM has 8 SPs, and 1 SP processes 1 thread.
But warp size is 32 not 8.
How can 32 threads process in parallel?
Each SP is pipelined. Most math ops use one clock, but with 4 clocks of latency.
The warp scheduler issues new instructions every 4 clocks as well.
So the 32 threads effectively are done in 4 passes of one clock each.
But most of this is abstracted from you, in your head you can just think of it as all 32 warp threads stepping forward simultaneously.
Thank you for your answer. :rolleyes:
And How can I get these informations not from this forum? Is there a white paper about the architecture like warp scheduler?
All the info needed to program is in the programming guide.
If you want deeper knowledge about the hardware and how the software maps on the hardware (and again, this is not needed for programming, it is all nicely abstracted away), papers like “scalable parallel programming with cuda” are nice to read.
Specifically, iara, there are references listed in the FAQ: did you read it?
Doesn’t it mean that at the end of 4 cycles, the 4 groups of 8 threads
are not realy done yet (in 1st, 2nd, 3rd and 4th pipeline stages, respectively)?
Yes, the pipeline is even deeper. 6 warps or 192 threads are necessary per multiprocessor to hide the complete pipeline depth. It is a shame the search function on the forum isn’t the best, but you can find relevant threads via google, just ask google to look for the following: 6 warps site:http://forums.nvidia.com