I’m doing some kind of research if included ILP in GF104 brings some benefit over the GF100 architecture. I will write my own simple applications, but i was wondering are there any benchmark results for GTX 465,GTX 470 or GTX 480 for examples that come with CUDA SDK so that i can compare them to my GTX 460 results?
As for the metrics, i won’t really compare the two architectures. Prime goal is to analyze performance on GTX 460, but I just wanted to have results from GF100 in order to better make some conclusions.
As for the metrics, i won’t really compare the two architectures. Prime goal is to analyze performance on GTX 460, but I just wanted to have results from GF100 in order to better make some conclusions.
It doesn’t seem to work as fast if you have odd number of warps. Check the wiggling in the plot at http://forums.nvidia.com/index.php?showtopic=176927, which was done at a finer sampling than those plots in the presentation slides. I don’t know what’s the problem, but I’d guess that this might have something to do with having two warp schedulers per SM.
It doesn’t seem to work as fast if you have odd number of warps. Check the wiggling in the plot at http://forums.nvidia.com/index.php?showtopic=176927, which was done at a finer sampling than those plots in the presentation slides. I don’t know what’s the problem, but I’d guess that this might have something to do with having two warp schedulers per SM.
I think i realised what’s the problem. At least for ILP=2. Arithemtic latency is 18.
a=b*a+e;
b=b*d + e;
Let’s assume we have 288 threads, that’s 9 warps.
Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.
Cycles 19 & 20 - We don’t have available warp for execution, cause warp1 1st instrucion will be ready after the 20. cycle…
Let’s assume we have 320 threads, that’s 10 warps.
Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.
I think i realised what’s the problem. At least for ILP=2. Arithemtic latency is 18.
a=b*a+e;
b=b*d + e;
Let’s assume we have 288 threads, that’s 9 warps.
Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.
Cycles 19 & 20 - We don’t have available warp for execution, cause warp1 1st instrucion will be ready after the 20. cycle…
Let’s assume we have 320 threads, that’s 10 warps.
Cycles 1 & 2 - Issue warp1 and warp2 1st instruction ; So, we are able to issuse 1st intruction again in the next iteration at 20. cycle, beacuse the latency is 18 and instruction is issued on all threads within the warp at the end of 2. cycle.
do you mean it? First instruction depends on the second instruction in the previous iteration. So, you can’t issue first instruction again at 20th cycles. I usually do something like this:
a=a*b+c;
d=d*b+c;
Anyway, if you consider your scheme for scheduling, why you CAN hide latency using 576 threads and 1 instruction per warp in the pipeline, but CANT if using 576/2=288 threads and 2 instructions per warp in the pipeline?
do you mean it? First instruction depends on the second instruction in the previous iteration. So, you can’t issue first instruction again at 20th cycles. I usually do something like this:
a=a*b+c;
d=d*b+c;
Anyway, if you consider your scheme for scheduling, why you CAN hide latency using 576 threads and 1 instruction per warp in the pipeline, but CANT if using 576/2=288 threads and 2 instructions per warp in the pipeline?
Yep. You’re right. It isn’t possible then for 576 threads and 1 instruction per warp in the pipeline. Now i totally don’t understand how things actually work. Is it possible that latency is lower than 18?
And i didn’t understand the part of explaning ILP regarding the occupancy in your presentation.
Using ILP and less threads doesn’t use less CUDA cores, it just shows that TLP isn’t the only way to get the peak.
edit: If by occupancy you mean to active warps/maximum warps then it’s ok
Yep. You’re right. It isn’t possible then for 576 threads and 1 instruction per warp in the pipeline. Now i totally don’t understand how things actually work. Is it possible that latency is lower than 18?
And i didn’t understand the part of explaning ILP regarding the occupancy in your presentation.
Using ILP and less threads doesn’t use less CUDA cores, it just shows that TLP isn’t the only way to get the peak.
edit: If by occupancy you mean to active warps/maximum warps then it’s ok
I think it depends on how count latency. In my interpretation, latency=18 means that you can issue dependent instructions on cycles 0, 18, 36, 54, … . So, it is the time between when you start executing new operation and when you can start executing the dependent operation. As I see you don’t count the first 2 cycles in, so in your interpretation latency is 2 cycles shorter.
Yeah, I refer to the occupancy as defined in the programming manual. In fact, I am surprised by how many people are confused by this. I never seen this term used in any other meaning.
I think it depends on how count latency. In my interpretation, latency=18 means that you can issue dependent instructions on cycles 0, 18, 36, 54, … . So, it is the time between when you start executing new operation and when you can start executing the dependent operation. As I see you don’t count the first 2 cycles in, so in your interpretation latency is 2 cycles shorter.
Yeah, I refer to the occupancy as defined in the programming manual. In fact, I am surprised by how many people are confused by this. I never seen this term used in any other meaning.
Sorry for little hijacking of thread, but my post is also related to GF100 and GF104.
If I understand correctly, GF100 has 32 cores per SM, and GF104 has 48 cores per SM.
According to Fermi whitepaper, GF100 uses Dual Warp Scheduler, so it runs one half-warp on 16 cores, and another half-warp (from different warp) on another 16 cores.
How does it behave for GF104, which has 48 cores? Does it run 3 half-warps at the same time, or does it run one warp and one half-warp?
Is there more recent whitepaper describing new Fermi archtectures? I maen GF104, GF106, GF108? I have only found some information on Wikipedia.
Sorry for little hijacking of thread, but my post is also related to GF100 and GF104.
If I understand correctly, GF100 has 32 cores per SM, and GF104 has 48 cores per SM.
According to Fermi whitepaper, GF100 uses Dual Warp Scheduler, so it runs one half-warp on 16 cores, and another half-warp (from different warp) on another 16 cores.
How does it behave for GF104, which has 48 cores? Does it run 3 half-warps at the same time, or does it run one warp and one half-warp?
Is there more recent whitepaper describing new Fermi archtectures? I maen GF104, GF106, GF108? I have only found some information on Wikipedia.