need help understanding GT200 computing model some tough questions ;)

I’m trying to analyze the architecture of the GTX 280 family and understand how it runs its computations. In the process some specific questions have arisen and I hope the CUDA community could help me out.

BTW, I did read the CUDA Programming Guide and the GTX 200 Technical Brief, so no need to send me there. ;)

  1. Since each SM contains 8 cores and it’s supposed to run 32 threads (a warp) at a time, I assume each SP must be pipelined with 4 threads at consecutive execution stages, correct? Hence, on every clock cycle a result for a single thread out of four is produced, giving an effective computation rate of 1/4th of the clock rate - say, 300MHz for a 1,2GHz clock - from the thread’s perspective. (I intentionally omit data read latencies here.)

  2. I’ve read somewhere that if an SP core encounters a memory read latency, it is capable of switching to another thread - possibly abandoning that stalled thread for many (>4 ?) cycles since memory read latencies can be quite huge. Does that mean that SP cores are capable of out-of-order execution, or is that a stretch? If the stall indeed takes more than 4 cycles, are the ‘wasted’ cycles (one out of four) assigned to another thread, or are they filled with no-ops?

  3. Is there any information on the pipeline contents and length of an SM - or of some of its parts, like the IU or SP?

  4. For the purpose of branch-heavy GPGPU, is it fair to say that a GTX 280 is in fact a 30-core processor (as opposed to 240), since that’s how many perfectly independent execution paths it is able to process in parallel?

I quite like this article, which gives a nice description of GPU architecture:

http://graphics.stanford.edu/~kayvonf/pape…ahalianCACM.pdf

How about the resources listed in the FAQ. They have some good information too.

*  J. Nickolls et al. "Scalable Programming with CUDA" ACM Queue, vol. 6 no. 2 Mar./Apr. 2008 pp 40-53

  <a target='_blank' rel='noopener noreferrer' href='"http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=532"'>http://www.acmqueue.org/modules.php?name=C...age&pid=532</a>

* E. Lindholm et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, vol. 28 no. 2, Mar.Apr. 2008, pp 39-55

It’s very fuzzily defined. Nobody really knows everything except for people at NVIDIA and they generally keep their mouths shut.

I can’t really give you anything more than a few additional tidbits which can be found mentioned by NVIDIA people on the forums if you dig hard enough:

  1. The warp size of 32 is set in software/driver/hardware microcode or something. The hardware is actually capable of a warp size of 16, but 32 was chosen so that if future hardware did go to 32, developers wouldn’t see a change.

  2. (1) => that the clock for the SMs is only twice the clock of the instruction decoders.

Not only can you get interleaved execution among different threads on a SP, a single thread can continue execution past a memory request until the register that memory request writes to is actually used. Only then will that thread go into a wait state and wait for the memory request to complete. Whether or not you define this as out-of-order is a matter of semantics.

The only other “out-of-order” type behavior I remember being mentioned is the ability to run the special function unit along with the normal unit to get a MADD+MUL at the same time.

I would say it is a matter of opinion. My opinion would be no. Reasons: 1) Even branch heavy code can be written to take advantage of parallel memory loads. With memory read coalescing 240 SPs reading data is A LOT faster than 30 would be. 2) A vast majority of the CUDA apps out there are limited by the memory bandwidth and latencies you want to ignore. For a memory bandwidth limited app, one can add a massive amount of branches “for free” as they can all be processed long before the next memory reads come in.

http://realworldtech.com/page.cfm?ArticleID=RWT090808195242

you will probably like that