I’m trying to analyze the architecture of the GTX 280 family and understand how it runs its computations. In the process some specific questions have arisen and I hope the CUDA community could help me out.
BTW, I did read the CUDA Programming Guide and the GTX 200 Technical Brief, so no need to send me there. ;)
Since each SM contains 8 cores and it’s supposed to run 32 threads (a warp) at a time, I assume each SP must be pipelined with 4 threads at consecutive execution stages, correct? Hence, on every clock cycle a result for a single thread out of four is produced, giving an effective computation rate of 1/4th of the clock rate - say, 300MHz for a 1,2GHz clock - from the thread’s perspective. (I intentionally omit data read latencies here.)
I’ve read somewhere that if an SP core encounters a memory read latency, it is capable of switching to another thread - possibly abandoning that stalled thread for many (>4 ?) cycles since memory read latencies can be quite huge. Does that mean that SP cores are capable of out-of-order execution, or is that a stretch? If the stall indeed takes more than 4 cycles, are the ‘wasted’ cycles (one out of four) assigned to another thread, or are they filled with no-ops?
Is there any information on the pipeline contents and length of an SM - or of some of its parts, like the IU or SP?
For the purpose of branch-heavy GPGPU, is it fair to say that a GTX 280 is in fact a 30-core processor (as opposed to 240), since that’s how many perfectly independent execution paths it is able to process in parallel?
How about the resources listed in the FAQ. They have some good information too.
* J. Nickolls et al. "Scalable Programming with CUDA" ACM Queue, vol. 6 no. 2 Mar./Apr. 2008 pp 40-53
<a target='_blank' rel='noopener noreferrer' href='"http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=532"'>http://www.acmqueue.org/modules.php?name=C...age&pid=532</a>
* E. Lindholm et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, vol. 28 no. 2, Mar.Apr. 2008, pp 39-55
It’s very fuzzily defined. Nobody really knows everything except for people at NVIDIA and they generally keep their mouths shut.
I can’t really give you anything more than a few additional tidbits which can be found mentioned by NVIDIA people on the forums if you dig hard enough:
The warp size of 32 is set in software/driver/hardware microcode or something. The hardware is actually capable of a warp size of 16, but 32 was chosen so that if future hardware did go to 32, developers wouldn’t see a change.
(1) => that the clock for the SMs is only twice the clock of the instruction decoders.
Not only can you get interleaved execution among different threads on a SP, a single thread can continue execution past a memory request until the register that memory request writes to is actually used. Only then will that thread go into a wait state and wait for the memory request to complete. Whether or not you define this as out-of-order is a matter of semantics.
The only other “out-of-order” type behavior I remember being mentioned is the ability to run the special function unit along with the normal unit to get a MADD+MUL at the same time.
I would say it is a matter of opinion. My opinion would be no. Reasons: 1) Even branch heavy code can be written to take advantage of parallel memory loads. With memory read coalescing 240 SPs reading data is A LOT faster than 30 would be. 2) A vast majority of the CUDA apps out there are limited by the memory bandwidth and latencies you want to ignore. For a memory bandwidth limited app, one can add a massive amount of branches “for free” as they can all be processed long before the next memory reads come in.