THat is because, the binomialOptions is not implemented in the most effective way. It performs redundant memory accesses and redundand computations as well… Its not well optimized. And, it is not full of "MAD"s to bring out the performance difference you are looking for.
To see the 1/8th performance difference that you want to see – you need to profile a typical DGEMM routine (that is full of MADs) on GTX 480 and C2050 to bring out the perf-difference. Make sure you use the CUBLAS that is optimized for C2050. I hope NVIDIA has released it with the toolkit.
OR
You can write a “mad” kernel that performs only MADs… Like what Vvolkov and others do: “a = a*b+c”.
Since FERMI has dual-issue, it would be a good idea to keep an “even” number of active warps. I think this was discussed by vvolkov and others (though I did not follow it completely… you may want to re-check on what they were talking about)
THat is because, the binomialOptions is not implemented in the most effective way. It performs redundant memory accesses and redundand computations as well… Its not well optimized. And, it is not full of "MAD"s to bring out the performance difference you are looking for.
To see the 1/8th performance difference that you want to see – you need to profile a typical DGEMM routine (that is full of MADs) on GTX 480 and C2050 to bring out the perf-difference. Make sure you use the CUBLAS that is optimized for C2050. I hope NVIDIA has released it with the toolkit.
OR
You can write a “mad” kernel that performs only MADs… Like what Vvolkov and others do: “a = a*b+c”.
Since FERMI has dual-issue, it would be a good idea to keep an “even” number of active warps. I think this was discussed by vvolkov and others (though I did not follow it completely… you may want to re-check on what they were talking about)
I now changed the simpleCUBLAS example from Sgemm to Dgemm and with N=(1024*7), which is about the largest N to fit onto the GTX 480 card, I get these execution times:
Tesla S2050 (1 GPU): 4394 ms
GTX 480 : 4880 ms
for N=4096
Tesla : 858 ms
GTX480 : 941 ms
So, finally, Tesla is slightly ahead of the GTX 480 ;-)
On the other hand it might be possible that the difference stems from a faster system bus as Tesla is connected to a Xeon system and GTX480 is connected to an i7 pc.
Anyway, my conclusion for now is
a) it must be extremely difficult to exploit Tesla’s improved dp units
B) I easily get the same (if not higher) performance from GTX 480, for about 1/10 of the price
c) if large GPU memory is not an issue, then Tesla is not worth the premium at all
I now changed the simpleCUBLAS example from Sgemm to Dgemm and with N=(1024*7), which is about the largest N to fit onto the GTX 480 card, I get these execution times:
Tesla S2050 (1 GPU): 4394 ms
GTX 480 : 4880 ms
for N=4096
Tesla : 858 ms
GTX480 : 941 ms
So, finally, Tesla is slightly ahead of the GTX 480 ;-)
On the other hand it might be possible that the difference stems from a faster system bus as Tesla is connected to a Xeon system and GTX480 is connected to an i7 pc.
Anyway, my conclusion for now is
a) it must be extremely difficult to exploit Tesla’s improved dp units
B) I easily get the same (if not higher) performance from GTX 480, for about 1/10 of the price
c) if large GPU memory is not an issue, then Tesla is not worth the premium at all
I am rather surprised at your results… though I have not used FERMI myself… Which version of cuBLAS are you using? I am assuming that it is the one that is FERMI aware…
I am rather surprised at your results… though I have not used FERMI myself… Which version of cuBLAS are you using? I am assuming that it is the one that is FERMI aware…
I have got centos 5.3, devdriver_3.1_linux_64_256.40, cudatoolkit_3.1_linux_64_rhel5.4, and gpucomputingsdk_3.1_linux installed. At least I believe that cublas is Fermi aware, but MAYBE it is not a good idea to install the toolkist for rhel5.4 on a centos 5.3 system?
The problem is that I have to stick to centos 5.3 for IB driver reasons.
For me the only explanation is that for some reasons only one of four dp units is ‘visible’ to my kernels.
I’ll take your advice and write some simple kernel with heavy local FMA instructions and see what happens…
I have got centos 5.3, devdriver_3.1_linux_64_256.40, cudatoolkit_3.1_linux_64_rhel5.4, and gpucomputingsdk_3.1_linux installed. At least I believe that cublas is Fermi aware, but MAYBE it is not a good idea to install the toolkist for rhel5.4 on a centos 5.3 system?
The problem is that I have to stick to centos 5.3 for IB driver reasons.
For me the only explanation is that for some reasons only one of four dp units is ‘visible’ to my kernels.
I’ll take your advice and write some simple kernel with heavy local FMA instructions and see what happens…
I think most of your results are explained by the approximately 20% higher bandwidth as others have already mentioned. The DGEMM example running on CUBLAS 3.1 also seems to be quite compute bound. Some nvidia benchmarks presented in june only reached around 170 DP GFLOPS out of the maximum of about 515. Thus it seems this example doesn’t exploit the extra DP units as you concluded in (a).
So it seems the extra memory shuffling caused by DP ( 2x) makes it compute bound. They should be able to overcome this since the there are now SGEMM codes running at over 1 TFLOP on GTX 480 ( MAGMA ), I haven’t done the numbers but wouldn’t that imply that they are already doing huge amounts of memory shuffling? Any thoughts ?
I think most of your results are explained by the approximately 20% higher bandwidth as others have already mentioned. The DGEMM example running on CUBLAS 3.1 also seems to be quite compute bound. Some nvidia benchmarks presented in june only reached around 170 DP GFLOPS out of the maximum of about 515. Thus it seems this example doesn’t exploit the extra DP units as you concluded in (a).
So it seems the extra memory shuffling caused by DP ( 2x) makes it compute bound. They should be able to overcome this since the there are now SGEMM codes running at over 1 TFLOP on GTX 480 ( MAGMA ), I haven’t done the numbers but wouldn’t that imply that they are already doing huge amounts of memory shuffling? Any thoughts ?
At least, from what I remember, NVIDIA had said that DP performance reaches 1/2 of SP performance… long time back – even before they officially released FERMI.
btw, Can you tell me more about “memory shuffling”?
Pezet,
Thanks for considering my advice. but please be aware that I am not an expert in Fermi… Wish you good luck!
I remember that FERMI DGEMM was running at 30% lesser than Peak performance sometime back. It was fixed later. Check this URL…
At least, from what I remember, NVIDIA had said that DP performance reaches 1/2 of SP performance… long time back – even before they officially released FERMI.
btw, Can you tell me more about “memory shuffling”?
Pezet,
Thanks for considering my advice. but please be aware that I am not an expert in Fermi… Wish you good luck!
I remember that FERMI DGEMM was running at 30% lesser than Peak performance sometime back. It was fixed later. Check this URL…
3.2 RC2 just released yesterday. The release note highlights says that CUBLAS performance has increased 50% to 300% for all data-types and APIs. You may want to check this out.
3.2 RC2 just released yesterday. The release note highlights says that CUBLAS performance has increased 50% to 300% for all data-types and APIs. You may want to check this out.