Nvidia GF104 vs GF100

renegade037 · October 1, 2010, 12:44pm

Hi everyone,

I’m doing some kind of research if included ILP in GF104 brings some benefit over the GF100 architecture. I will write my own simple applications, but i was wondering are there any benchmark results for GTX 465,GTX 470 or GTX 480 for examples that come with CUDA SDK so that i can compare them to my GTX 460 results?

Thx in advance.

allanmac · October 1, 2010, 3:07pm

What metric will you use to make this comparison? FLOPS/chip-size? FLOPS/Watt? FLOPS/core? FLOPS/USD$?

You probably have seen them already, but Vasily Volkov’s GTC presentation might help you maximize ILP in whatever benchmarks you choose: [url=“http://www.cs.berkeley.edu/~volkov/”]http://www.cs.berkeley.edu/~volkov/[/url]

allanmac · October 1, 2010, 3:07pm

What metric will you use to make this comparison? FLOPS/chip-size? FLOPS/Watt? FLOPS/core? FLOPS/USD$?

You probably have seen them already, but Vasily Volkov’s GTC presentation might help you maximize ILP in whatever benchmarks you choose: [url=“http://www.cs.berkeley.edu/~volkov/”]http://www.cs.berkeley.edu/~volkov/[/url]

renegade037 · October 2, 2010, 12:26am

Thx, I will check that Volkov’s GTC presentation.

As for the metrics, i won’t really compare the two architectures. Prime goal is to analyze performance on GTX 460, but I just wanted to have results from GF100 in order to better make some conclusions.

renegade037 · October 2, 2010, 12:26am

Thx, I will check that Volkov’s GTC presentation.

As for the metrics, i won’t really compare the two architectures. Prime goal is to analyze performance on GTX 460, but I just wanted to have results from GF100 in order to better make some conclusions.

renegade037 · October 3, 2010, 6:35am

Can someone please describe why ILP doesn’t scale well past 2 and even the results aren’t at theoretical maximum with ILP=2?

There isn’t dependacy between instructions, warp that is available is issued with 0 latency…

If I understood well, latency that we need to hide is 18 cycles.

ILP=2. Let’s assume we have 288 threads, that’s 9 warps.

In 18 cycles we can issue 2 operations for all threads within 9 warps.

In 19th cycle the operands for one operation in the next iteration within the 1 warp are ready and we can issue the operation.

Why we need 320 threads(10 warps) to have 100% utilization. Am i missing something?

renegade037 · October 3, 2010, 6:35am

Can someone please describe why ILP doesn’t scale well past 2 and even the results aren’t at theoretical maximum with ILP=2?

There isn’t dependacy between instructions, warp that is available is issued with 0 latency…

If I understood well, latency that we need to hide is 18 cycles.

ILP=2. Let’s assume we have 288 threads, that’s 9 warps.

In 18 cycles we can issue 2 operations for all threads within 9 warps.

In 19th cycle the operands for one operation in the next iteration within the 1 warp are ready and we can issue the operation.

Why we need 320 threads(10 warps) to have 100% utilization. Am i missing something?

vvolkov · October 5, 2010, 9:41am

It doesn’t seem to work as fast if you have odd number of warps. Check the wiggling in the plot at http://forums.nvidia.com/index.php?showtopic=176927, which was done at a finer sampling than those plots in the presentation slides. I don’t know what’s the problem, but I’d guess that this might have something to do with having two warp schedulers per SM.

I think ILP doesn’t scale very far because the space in the scoreboard is limited. Check NVIDIA patent on scoreboarding to get some more details: http://www.google.com/patents/about?id=vDiuAAAAEBAJ.

Vasily

vvolkov · October 5, 2010, 9:41am

It doesn’t seem to work as fast if you have odd number of warps. Check the wiggling in the plot at http://forums.nvidia.com/index.php?showtopic=176927, which was done at a finer sampling than those plots in the presentation slides. I don’t know what’s the problem, but I’d guess that this might have something to do with having two warp schedulers per SM.

I think ILP doesn’t scale very far because the space in the scoreboard is limited. Check NVIDIA patent on scoreboarding to get some more details: http://www.google.com/patents/about?id=vDiuAAAAEBAJ.

Vasily

renegade037 · October 5, 2010, 12:44pm

I think that register problems aren’t the problem.

More on that issue i found here:

http://forum.beyond3d.com/showthread.php?t=58077

I think i realised what’s the problem. At least for ILP=2. Arithemtic latency is 18.

a=b*a+e;

b=b*d + e;

Let’s assume we have 288 threads, that’s 9 warps.

Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.

Cycles 3 & 4 - Issue warp3 and warp4 1st instruction

Cycles 5 & 6 - Issue warp5 and warp6 1st instruction

Cycles 7 & 8 - Issue warp7 and warp8 1st instruction

Cycles 9 & 10 - Issue warp9 1st instruction and warp1 2nd instruction

Cycles 11 & 12 - Issue warp2 and warp3 2nd instruction

Cycles 13 & 14 - Issue warp4 and warp5 2nd instruction

Cycles 15 & 16 - Issue warp6 and warp7 2nd instruction

Cycles 17 & 18 - Issue warp8 and warp9 2nd instruction

Cycles 19 & 20 - We don’t have available warp for execution, cause warp1 1st instrucion will be ready after the 20. cycle…

Let’s assume we have 320 threads, that’s 10 warps.

Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.

Cycles 3 & 4 - Issue warp3 and warp4 1st instruction

Cycles 5 & 6 - Issue warp5 and warp6 1st instruction

Cycles 7 & 8 - Issue warp7 and warp8 1st instruction

Cycles 9 & 10 - Issue warp9 and warp10 1st instruction

Cycles 11 & 12 - Issue warp1 and warp2 2nd instruction

Cycles 13 & 14 - Issue warp3 and warp4 2nd instruction

Cycles 15 & 16 - Issue warp5 and warp6 2nd instruction

Cycles 17 & 18 - Issue warp7 and warp8 2nd instruction

Cycles 19 & 20 - Issue warp9 and warp10 2nd instruction

Cycles 20 & 21 - Issue warp1 and warp2 1st instruction.

So, with 320 threads we don’t have idle cycles and get the full peak.

Of course, I’m not sure if the things really work like this, but please correct me if I’m wrong.

In this case, CUDA cores have at pipeline stages two different instructions and i think you mentioned that in your presentation.

vvolkov · October 5, 2010, 1:04pm

do you mean it? First instruction depends on the second instruction in the previous iteration. So, you can’t issue first instruction again at 20th cycles. I usually do something like this:

a=a*b+c;

d=d*b+c;

Anyway, if you consider your scheme for scheduling, why you CAN hide latency using 576 threads and 1 instruction per warp in the pipeline, but CANT if using 576/2=288 threads and 2 instructions per warp in the pipeline?

Vasily

vvolkov · October 5, 2010, 1:04pm

do you mean it? First instruction depends on the second instruction in the previous iteration. So, you can’t issue first instruction again at 20th cycles. I usually do something like this:

a=a*b+c;

d=d*b+c;

Anyway, if you consider your scheme for scheduling, why you CAN hide latency using 576 threads and 1 instruction per warp in the pipeline, but CANT if using 576/2=288 threads and 2 instructions per warp in the pipeline?

Vasily

renegade037 · October 5, 2010, 2:20pm

do you mean it? First instruction depends on the second instruction in the previous iteration. So, you can’t issue first instruction again at 20th cycles. I usually do something like this:
a=a*b+c;

d=d*b+c;

Sry, my bad.

Instructions shoud be like you wrote:

a=a*b+c;

d=d*b+c;

Yep. You’re right. It isn’t possible then for 576 threads and 1 instruction per warp in the pipeline. Now i totally don’t understand how things actually work. Is it possible that latency is lower than 18?

And i didn’t understand the part of explaning ILP regarding the occupancy in your presentation.

Using ILP and less threads doesn’t use less CUDA cores, it just shows that TLP isn’t the only way to get the peak.

edit: If by occupancy you mean to active warps/maximum warps then it’s ok

renegade037 · October 5, 2010, 2:20pm

do you mean it? First instruction depends on the second instruction in the previous iteration. So, you can’t issue first instruction again at 20th cycles. I usually do something like this:
a=a*b+c;

d=d*b+c;

Sry, my bad.

Instructions shoud be like you wrote:

a=a*b+c;

d=d*b+c;

Yep. You’re right. It isn’t possible then for 576 threads and 1 instruction per warp in the pipeline. Now i totally don’t understand how things actually work. Is it possible that latency is lower than 18?

And i didn’t understand the part of explaning ILP regarding the occupancy in your presentation.

Using ILP and less threads doesn’t use less CUDA cores, it just shows that TLP isn’t the only way to get the peak.

edit: If by occupancy you mean to active warps/maximum warps then it’s ok

vvolkov · October 7, 2010, 4:46am

I think it depends on how count latency. In my interpretation, latency=18 means that you can issue dependent instructions on cycles 0, 18, 36, 54, … . So, it is the time between when you start executing new operation and when you can start executing the dependent operation. As I see you don’t count the first 2 cycles in, so in your interpretation latency is 2 cycles shorter.

Yeah, I refer to the occupancy as defined in the programming manual. In fact, I am surprised by how many people are confused by this. I never seen this term used in any other meaning.

Vasily

vvolkov · October 7, 2010, 4:46am

I think it depends on how count latency. In my interpretation, latency=18 means that you can issue dependent instructions on cycles 0, 18, 36, 54, … . So, it is the time between when you start executing new operation and when you can start executing the dependent operation. As I see you don’t count the first 2 cycles in, so in your interpretation latency is 2 cycles shorter.

Yeah, I refer to the occupancy as defined in the programming manual. In fact, I am surprised by how many people are confused by this. I never seen this term used in any other meaning.

Vasily

Tomasz_Rybak · October 10, 2010, 9:17pm

Sorry for little hijacking of thread, but my post is also related to GF100 and GF104.

If I understand correctly, GF100 has 32 cores per SM, and GF104 has 48 cores per SM.
According to Fermi whitepaper, GF100 uses Dual Warp Scheduler, so it runs one half-warp on 16 cores, and another half-warp (from different warp) on another 16 cores.
How does it behave for GF104, which has 48 cores? Does it run 3 half-warps at the same time, or does it run one warp and one half-warp?

Is there more recent whitepaper describing new Fermi archtectures? I maen GF104, GF106, GF108? I have only found some information on Wikipedia.

Tomasz_Rybak · October 10, 2010, 9:17pm

Sorry for little hijacking of thread, but my post is also related to GF100 and GF104.

If I understand correctly, GF100 has 32 cores per SM, and GF104 has 48 cores per SM.
According to Fermi whitepaper, GF100 uses Dual Warp Scheduler, so it runs one half-warp on 16 cores, and another half-warp (from different warp) on another 16 cores.
How does it behave for GF104, which has 48 cores? Does it run 3 half-warps at the same time, or does it run one warp and one half-warp?

Is there more recent whitepaper describing new Fermi archtectures? I maen GF104, GF106, GF108? I have only found some information on Wikipedia.

renegade037 · October 10, 2010, 9:25pm

It issues 1 or 2 instructions of one half-warp and 1 or 2 instructions of other half-warp.

In order to utilize all 48 cores, GF104 needs at least 2 instructions issued from one of the half-warps.

Topic		Replies	Views
GF100 vs GF104 Performance question CUDA Programming and Performance	18	9100	September 4, 2010
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15819	February 4, 2011
a deep dive into Instruction-level parallelism CUDA Programming and Performance	17	5579	December 18, 2018
Theoretical peak performance question GF100 can't co-issue instructions can it? CUDA Programming and Performance	15	3501	March 3, 2011
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4185	April 8, 2012
GPU architecture and warp scheduling CUDA Programming and Performance	10	20730	February 10, 2018
Warp Size Question CUDA Programming and Performance	21	14271	June 18, 2010
Pipelined Loads CUDA Programming and Performance	54	7693	September 21, 2010
GTX 460 CUDA Programming and Performance	58	60555	August 5, 2010
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	476	January 19, 2025

Nvidia GF104 vs GF100

Related topics