Nvidia GF104 vs GF100

Hi everyone,

I’m doing some kind of research if included ILP in GF104 brings some benefit over the GF100 architecture. I will write my own simple applications, but i was wondering are there any benchmark results for GTX 465,GTX 470 or GTX 480 for examples that come with CUDA SDK so that i can compare them to my GTX 460 results?

Thx in advance.

What metric will you use to make this comparison? FLOPS/chip-size? FLOPS/Watt? FLOPS/core? FLOPS/USD$?

You probably have seen them already, but Vasily Volkov’s GTC presentation might help you maximize ILP in whatever benchmarks you choose: [url=“http://www.cs.berkeley.edu/~volkov/”]http://www.cs.berkeley.edu/~volkov/[/url]

What metric will you use to make this comparison? FLOPS/chip-size? FLOPS/Watt? FLOPS/core? FLOPS/USD$?

You probably have seen them already, but Vasily Volkov’s GTC presentation might help you maximize ILP in whatever benchmarks you choose: [url=“http://www.cs.berkeley.edu/~volkov/”]http://www.cs.berkeley.edu/~volkov/[/url]

Thx, I will check that Volkov’s GTC presentation.

As for the metrics, i won’t really compare the two architectures. Prime goal is to analyze performance on GTX 460, but I just wanted to have results from GF100 in order to better make some conclusions.

Thx, I will check that Volkov’s GTC presentation.

As for the metrics, i won’t really compare the two architectures. Prime goal is to analyze performance on GTX 460, but I just wanted to have results from GF100 in order to better make some conclusions.

Can someone please describe why ILP doesn’t scale well past 2 and even the results aren’t at theoretical maximum with ILP=2?

There isn’t dependacy between instructions, warp that is available is issued with 0 latency…

If I understood well, latency that we need to hide is 18 cycles.

ILP=2. Let’s assume we have 288 threads, that’s 9 warps.

In 18 cycles we can issue 2 operations for all threads within 9 warps.

In 19th cycle the operands for one operation in the next iteration within the 1 warp are ready and we can issue the operation.

Why we need 320 threads(10 warps) to have 100% utilization. Am i missing something?

Can someone please describe why ILP doesn’t scale well past 2 and even the results aren’t at theoretical maximum with ILP=2?

There isn’t dependacy between instructions, warp that is available is issued with 0 latency…

If I understood well, latency that we need to hide is 18 cycles.

ILP=2. Let’s assume we have 288 threads, that’s 9 warps.

In 18 cycles we can issue 2 operations for all threads within 9 warps.

In 19th cycle the operands for one operation in the next iteration within the 1 warp are ready and we can issue the operation.

Why we need 320 threads(10 warps) to have 100% utilization. Am i missing something?

It doesn’t seem to work as fast if you have odd number of warps. Check the wiggling in the plot at http://forums.nvidia.com/index.php?showtopic=176927, which was done at a finer sampling than those plots in the presentation slides. I don’t know what’s the problem, but I’d guess that this might have something to do with having two warp schedulers per SM.

I think ILP doesn’t scale very far because the space in the scoreboard is limited. Check NVIDIA patent on scoreboarding to get some more details: http://www.google.com/patents/about?id=vDiuAAAAEBAJ.

Vasily

It doesn’t seem to work as fast if you have odd number of warps. Check the wiggling in the plot at http://forums.nvidia.com/index.php?showtopic=176927, which was done at a finer sampling than those plots in the presentation slides. I don’t know what’s the problem, but I’d guess that this might have something to do with having two warp schedulers per SM.

I think ILP doesn’t scale very far because the space in the scoreboard is limited. Check NVIDIA patent on scoreboarding to get some more details: http://www.google.com/patents/about?id=vDiuAAAAEBAJ.

Vasily

I think that register problems aren’t the problem.

More on that issue i found here:

http://forum.beyond3d.com/showthread.php?t=58077

I think i realised what’s the problem. At least for ILP=2. Arithemtic latency is 18.

a=b*a+e;

b=b*d + e;

Let’s assume we have 288 threads, that’s 9 warps.

Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.

Cycles 3 & 4 - Issue warp3 and warp4 1st instruction

Cycles 5 & 6 - Issue warp5 and warp6 1st instruction

Cycles 7 & 8 - Issue warp7 and warp8 1st instruction

Cycles 9 & 10 - Issue warp9 1st instruction and warp1 2nd instruction

Cycles 11 & 12 - Issue warp2 and warp3 2nd instruction

Cycles 13 & 14 - Issue warp4 and warp5 2nd instruction

Cycles 15 & 16 - Issue warp6 and warp7 2nd instruction

Cycles 17 & 18 - Issue warp8 and warp9 2nd instruction

Cycles 19 & 20 - We don’t have available warp for execution, cause warp1 1st instrucion will be ready after the 20. cycle…

Let’s assume we have 320 threads, that’s 10 warps.

Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.

Cycles 3 & 4 - Issue warp3 and warp4 1st instruction

Cycles 5 & 6 - Issue warp5 and warp6 1st instruction

Cycles 7 & 8 - Issue warp7 and warp8 1st instruction

Cycles 9 & 10 - Issue warp9 and warp10 1st instruction

Cycles 11 & 12 - Issue warp1 and warp2 2nd instruction

Cycles 13 & 14 - Issue warp3 and warp4 2nd instruction

Cycles 15 & 16 - Issue warp5 and warp6 2nd instruction

Cycles 17 & 18 - Issue warp7 and warp8 2nd instruction

Cycles 19 & 20 - Issue warp9 and warp10 2nd instruction

Cycles 20 & 21 - Issue warp1 and warp2 1st instruction.

So, with 320 threads we don’t have idle cycles and get the full peak.

Of course, I’m not sure if the things really work like this, but please correct me if I’m wrong.

In this case, CUDA cores have at pipeline stages two different instructions and i think you mentioned that in your presentation.

I think that register problems aren’t the problem.

More on that issue i found here:

http://forum.beyond3d.com/showthread.php?t=58077

I think i realised what’s the problem. At least for ILP=2. Arithemtic latency is 18.

a=b*a+e;

b=b*d + e;

Let’s assume we have 288 threads, that’s 9 warps.

Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.

Cycles 3 & 4 - Issue warp3 and warp4 1st instruction

Cycles 5 & 6 - Issue warp5 and warp6 1st instruction

Cycles 7 & 8 - Issue warp7 and warp8 1st instruction

Cycles 9 & 10 - Issue warp9 1st instruction and warp1 2nd instruction

Cycles 11 & 12 - Issue warp2 and warp3 2nd instruction

Cycles 13 & 14 - Issue warp4 and warp5 2nd instruction

Cycles 15 & 16 - Issue warp6 and warp7 2nd instruction

Cycles 17 & 18 - Issue warp8 and warp9 2nd instruction

Cycles 19 & 20 - We don’t have available warp for execution, cause warp1 1st instrucion will be ready after the 20. cycle…

Let’s assume we have 320 threads, that’s 10 warps.

Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.

Cycles 3 & 4 - Issue warp3 and warp4 1st instruction

Cycles 5 & 6 - Issue warp5 and warp6 1st instruction

Cycles 7 & 8 - Issue warp7 and warp8 1st instruction

Cycles 9 & 10 - Issue warp9 and warp10 1st instruction

Cycles 11 & 12 - Issue warp1 and warp2 2nd instruction

Cycles 13 & 14 - Issue warp3 and warp4 2nd instruction

Cycles 15 & 16 - Issue warp5 and warp6 2nd instruction

Cycles 17 & 18 - Issue warp7 and warp8 2nd instruction

Cycles 19 & 20 - Issue warp9 and warp10 2nd instruction

Cycles 20 & 21 - Issue warp1 and warp2 1st instruction.

So, with 320 threads we don’t have idle cycles and get the full peak.

Of course, I’m not sure if the things really work like this, but please correct me if I’m wrong.

In this case, CUDA cores have at pipeline stages two different instructions and i think you mentioned that in your presentation.

do you mean it? First instruction depends on the second instruction in the previous iteration. So, you can’t issue first instruction again at 20th cycles. I usually do something like this:

a=a*b+c;

d=d*b+c;

Anyway, if you consider your scheme for scheduling, why you CAN hide latency using 576 threads and 1 instruction per warp in the pipeline, but CANT if using 576/2=288 threads and 2 instructions per warp in the pipeline?

Vasily

do you mean it? First instruction depends on the second instruction in the previous iteration. So, you can’t issue first instruction again at 20th cycles. I usually do something like this:

a=a*b+c;

d=d*b+c;

Anyway, if you consider your scheme for scheduling, why you CAN hide latency using 576 threads and 1 instruction per warp in the pipeline, but CANT if using 576/2=288 threads and 2 instructions per warp in the pipeline?

Vasily

Sry, my bad.

Instructions shoud be like you wrote:

a=a*b+c;

d=d*b+c;

Yep. You’re right. It isn’t possible then for 576 threads and 1 instruction per warp in the pipeline. Now i totally don’t understand how things actually work. Is it possible that latency is lower than 18?

And i didn’t understand the part of explaning ILP regarding the occupancy in your presentation.

Using ILP and less threads doesn’t use less CUDA cores, it just shows that TLP isn’t the only way to get the peak.

edit: If by occupancy you mean to active warps/maximum warps then it’s ok

Sry, my bad.

Instructions shoud be like you wrote:

a=a*b+c;

d=d*b+c;

Yep. You’re right. It isn’t possible then for 576 threads and 1 instruction per warp in the pipeline. Now i totally don’t understand how things actually work. Is it possible that latency is lower than 18?

And i didn’t understand the part of explaning ILP regarding the occupancy in your presentation.

Using ILP and less threads doesn’t use less CUDA cores, it just shows that TLP isn’t the only way to get the peak.

edit: If by occupancy you mean to active warps/maximum warps then it’s ok

I think it depends on how count latency. In my interpretation, latency=18 means that you can issue dependent instructions on cycles 0, 18, 36, 54, … . So, it is the time between when you start executing new operation and when you can start executing the dependent operation. As I see you don’t count the first 2 cycles in, so in your interpretation latency is 2 cycles shorter.

Yeah, I refer to the occupancy as defined in the programming manual. In fact, I am surprised by how many people are confused by this. I never seen this term used in any other meaning.

Vasily

I think it depends on how count latency. In my interpretation, latency=18 means that you can issue dependent instructions on cycles 0, 18, 36, 54, … . So, it is the time between when you start executing new operation and when you can start executing the dependent operation. As I see you don’t count the first 2 cycles in, so in your interpretation latency is 2 cycles shorter.

Yeah, I refer to the occupancy as defined in the programming manual. In fact, I am surprised by how many people are confused by this. I never seen this term used in any other meaning.

Vasily

Sorry for little hijacking of thread, but my post is also related to GF100 and GF104.

If I understand correctly, GF100 has 32 cores per SM, and GF104 has 48 cores per SM.
According to Fermi whitepaper, GF100 uses Dual Warp Scheduler, so it runs one half-warp on 16 cores, and another half-warp (from different warp) on another 16 cores.
How does it behave for GF104, which has 48 cores? Does it run 3 half-warps at the same time, or does it run one warp and one half-warp?

Is there more recent whitepaper describing new Fermi archtectures? I maen GF104, GF106, GF108? I have only found some information on Wikipedia.

Sorry for little hijacking of thread, but my post is also related to GF100 and GF104.

If I understand correctly, GF100 has 32 cores per SM, and GF104 has 48 cores per SM.
According to Fermi whitepaper, GF100 uses Dual Warp Scheduler, so it runs one half-warp on 16 cores, and another half-warp (from different warp) on another 16 cores.
How does it behave for GF104, which has 48 cores? Does it run 3 half-warps at the same time, or does it run one warp and one half-warp?

Is there more recent whitepaper describing new Fermi archtectures? I maen GF104, GF106, GF108? I have only found some information on Wikipedia.

It issues 1 or 2 instructions of one half-warp and 1 or 2 instructions of other half-warp.

In order to utilize all 48 cores, GF104 needs at least 2 instructions issued from one of the half-warps.