need help understanding GT200 computing model some tough questions ;)

Axure · February 23, 2009, 1:44pm

I’m trying to analyze the architecture of the GTX 280 family and understand how it runs its computations. In the process some specific questions have arisen and I hope the CUDA community could help me out.

BTW, I did read the CUDA Programming Guide and the GTX 200 Technical Brief, so no need to send me there. ;)

Since each SM contains 8 cores and it’s supposed to run 32 threads (a warp) at a time, I assume each SP must be pipelined with 4 threads at consecutive execution stages, correct? Hence, on every clock cycle a result for a single thread out of four is produced, giving an effective computation rate of 1/4th of the clock rate - say, 300MHz for a 1,2GHz clock - from the thread’s perspective. (I intentionally omit data read latencies here.)
I’ve read somewhere that if an SP core encounters a memory read latency, it is capable of switching to another thread - possibly abandoning that stalled thread for many (>4 ?) cycles since memory read latencies can be quite huge. Does that mean that SP cores are capable of out-of-order execution, or is that a stretch? If the stall indeed takes more than 4 cycles, are the ‘wasted’ cycles (one out of four) assigned to another thread, or are they filled with no-ops?
Is there any information on the pipeline contents and length of an SM - or of some of its parts, like the IU or SP?
For the purpose of branch-heavy GPGPU, is it fair to say that a GTX 280 is in fact a 30-core processor (as opposed to 240), since that’s how many perfectly independent execution paths it is able to process in parallel?

e.ping · February 23, 2009, 1:51pm

I quite like this article, which gives a nice description of GPU architecture:

[url=“http://graphics.stanford.edu/~kayvonf/papers/fatahalianCACM.pdf”]http://graphics.stanford.edu/~kayvonf/pape...ahalianCACM.pdf[/url]

MisterAnderson42 · February 23, 2009, 4:26pm

How about the resources listed in the FAQ. They have some good information too.

*  J. Nickolls et al. "Scalable Programming with CUDA" ACM Queue, vol. 6 no. 2 Mar./Apr. 2008 pp 40-53

  <a target='_blank' rel='noopener noreferrer' href='"http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=532"'>http://www.acmqueue.org/modules.php?name=C...age&pid=532</a>

* E. Lindholm et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, vol. 28 no. 2, Mar.Apr. 2008, pp 39-55

It’s very fuzzily defined. Nobody really knows everything except for people at NVIDIA and they generally keep their mouths shut.

I can’t really give you anything more than a few additional tidbits which can be found mentioned by NVIDIA people on the forums if you dig hard enough:

The warp size of 32 is set in software/driver/hardware microcode or something. The hardware is actually capable of a warp size of 16, but 32 was chosen so that if future hardware did go to 32, developers wouldn’t see a change.
(1) => that the clock for the SMs is only twice the clock of the instruction decoders.

Not only can you get interleaved execution among different threads on a SP, a single thread can continue execution past a memory request until the register that memory request writes to is actually used. Only then will that thread go into a wait state and wait for the memory request to complete. Whether or not you define this as out-of-order is a matter of semantics.

The only other “out-of-order” type behavior I remember being mentioned is the ability to run the special function unit along with the normal unit to get a MADD+MUL at the same time.

I would say it is a matter of opinion. My opinion would be no. Reasons: 1) Even branch heavy code can be written to take advantage of parallel memory loads. With memory read coalescing 240 SPs reading data is A LOT faster than 30 would be. 2) A vast majority of the CUDA apps out there are limited by the memory bandwidth and latencies you want to ignore. For a memory bandwidth limited app, one can add a massive amount of branches “for free” as they can all be processed long before the next memory reads come in.

tmurray · February 23, 2009, 5:39pm

[url=“NVIDIA's GT200: Inside a Parallel Processor”]Real World Tech

you will probably like that

Topic		Replies	Views
specific Qs on GT200 processing abilities some might require Nvidia employee input CUDA Programming and Performance	1	1533	March 7, 2009
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24856	September 6, 2009
Latency and low-level performance questions CUDA Programming and Performance	10	4291	June 23, 2015
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5567	July 28, 2009
Stupid (?) questions about Warp vs. Half Warp vs. SM width CUDA Programming and Performance	3	43765	November 12, 2010
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	8938	January 24, 2008
questions about sp and sm CUDA Programming and Performance	5	4038	June 19, 2019
1 MP has 8 SP, but warp size is 32! CUDA Programming and Performance	6	3446	January 22, 2009
Thread Scheduling Concept CUDA Programming and Performance	3	3727	June 21, 2012
Warp Size Question CUDA Programming and Performance	21	13976	June 18, 2010

need help understanding GT200 computing model some tough questions ;)

Related topics