Tesla S2050 performance double precision performance too low

THat is because, the binomialOptions is not implemented in the most effective way. It performs redundant memory accesses and redundand computations as well… Its not well optimized. And, it is not full of "MAD"s to bring out the performance difference you are looking for.

To see the 1/8th performance difference that you want to see – you need to profile a typical DGEMM routine (that is full of MADs) on GTX 480 and C2050 to bring out the perf-difference. Make sure you use the CUBLAS that is optimized for C2050. I hope NVIDIA has released it with the toolkit.

OR

You can write a “mad” kernel that performs only MADs… Like what Vvolkov and others do: “a = a*b+c”.
Since FERMI has dual-issue, it would be a good idea to keep an “even” number of active warps. I think this was discussed by vvolkov and others (though I did not follow it completely… you may want to re-check on what they were talking about)

THat is because, the binomialOptions is not implemented in the most effective way. It performs redundant memory accesses and redundand computations as well… Its not well optimized. And, it is not full of "MAD"s to bring out the performance difference you are looking for.

To see the 1/8th performance difference that you want to see – you need to profile a typical DGEMM routine (that is full of MADs) on GTX 480 and C2050 to bring out the perf-difference. Make sure you use the CUBLAS that is optimized for C2050. I hope NVIDIA has released it with the toolkit.

OR

You can write a “mad” kernel that performs only MADs… Like what Vvolkov and others do: “a = a*b+c”.
Since FERMI has dual-issue, it would be a good idea to keep an “even” number of active warps. I think this was discussed by vvolkov and others (though I did not follow it completely… you may want to re-check on what they were talking about)

Hi Sarnath,

I now changed the simpleCUBLAS example from Sgemm to Dgemm and with N=(1024*7), which is about the largest N to fit onto the GTX 480 card, I get these execution times:

Tesla S2050 (1 GPU): 4394 ms

GTX 480 : 4880 ms

for N=4096

Tesla : 858 ms

GTX480 : 941 ms

So, finally, Tesla is slightly ahead of the GTX 480 ;-)

On the other hand it might be possible that the difference stems from a faster system bus as Tesla is connected to a Xeon system and GTX480 is connected to an i7 pc.

Anyway, my conclusion for now is

a) it must be extremely difficult to exploit Tesla’s improved dp units

B) I easily get the same (if not higher) performance from GTX 480, for about 1/10 of the price

c) if large GPU memory is not an issue, then Tesla is not worth the premium at all

thank you for your help.

Hi Sarnath,

I now changed the simpleCUBLAS example from Sgemm to Dgemm and with N=(1024*7), which is about the largest N to fit onto the GTX 480 card, I get these execution times:

Tesla S2050 (1 GPU): 4394 ms

GTX 480 : 4880 ms

for N=4096

Tesla : 858 ms

GTX480 : 941 ms

So, finally, Tesla is slightly ahead of the GTX 480 ;-)

On the other hand it might be possible that the difference stems from a faster system bus as Tesla is connected to a Xeon system and GTX480 is connected to an i7 pc.

Anyway, my conclusion for now is

a) it must be extremely difficult to exploit Tesla’s improved dp units

B) I easily get the same (if not higher) performance from GTX 480, for about 1/10 of the price

c) if large GPU memory is not an issue, then Tesla is not worth the premium at all

thank you for your help.

Pezet,

I am rather surprised at your results… though I have not used FERMI myself… Which version of cuBLAS are you using? I am assuming that it is the one that is FERMI aware…

Pezet,

I am rather surprised at your results… though I have not used FERMI myself… Which version of cuBLAS are you using? I am assuming that it is the one that is FERMI aware…

Hello Sarnath,

I have got centos 5.3, devdriver_3.1_linux_64_256.40, cudatoolkit_3.1_linux_64_rhel5.4, and gpucomputingsdk_3.1_linux installed. At least I believe that cublas is Fermi aware, but MAYBE it is not a good idea to install the toolkist for rhel5.4 on a centos 5.3 system?

The problem is that I have to stick to centos 5.3 for IB driver reasons.

For me the only explanation is that for some reasons only one of four dp units is ‘visible’ to my kernels.

I’ll take your advice and write some simple kernel with heavy local FMA instructions and see what happens…

best wishes,

Peter

Hello Sarnath,

I have got centos 5.3, devdriver_3.1_linux_64_256.40, cudatoolkit_3.1_linux_64_rhel5.4, and gpucomputingsdk_3.1_linux installed. At least I believe that cublas is Fermi aware, but MAYBE it is not a good idea to install the toolkist for rhel5.4 on a centos 5.3 system?

The problem is that I have to stick to centos 5.3 for IB driver reasons.

For me the only explanation is that for some reasons only one of four dp units is ‘visible’ to my kernels.

I’ll take your advice and write some simple kernel with heavy local FMA instructions and see what happens…

best wishes,

Peter

I think most of your results are explained by the approximately 20% higher bandwidth as others have already mentioned. The DGEMM example running on CUBLAS 3.1 also seems to be quite compute bound. Some nvidia benchmarks presented in june only reached around 170 DP GFLOPS out of the maximum of about 515. Thus it seems this example doesn’t exploit the extra DP units as you concluded in (a).

So it seems the extra memory shuffling caused by DP ( 2x) makes it compute bound. They should be able to overcome this since the there are now SGEMM codes running at over 1 TFLOP on GTX 480 ( MAGMA ), I haven’t done the numbers but wouldn’t that imply that they are already doing huge amounts of memory shuffling? Any thoughts ?

I think most of your results are explained by the approximately 20% higher bandwidth as others have already mentioned. The DGEMM example running on CUBLAS 3.1 also seems to be quite compute bound. Some nvidia benchmarks presented in june only reached around 170 DP GFLOPS out of the maximum of about 515. Thus it seems this example doesn’t exploit the extra DP units as you concluded in (a).

So it seems the extra memory shuffling caused by DP ( 2x) makes it compute bound. They should be able to overcome this since the there are now SGEMM codes running at over 1 TFLOP on GTX 480 ( MAGMA ), I haven’t done the numbers but wouldn’t that imply that they are already doing huge amounts of memory shuffling? Any thoughts ?

Jim,

At least, from what I remember, NVIDIA had said that DP performance reaches 1/2 of SP performance… long time back – even before they officially released FERMI.

btw, Can you tell me more about “memory shuffling”?

Pezet,

Thanks for considering my advice. but please be aware that I am not an expert in Fermi… Wish you good luck!

I remember that FERMI DGEMM was running at 30% lesser than Peak performance sometime back. It was fixed later. Check this URL…

http://developer.download.nvidia.com/compu…Notes_Linux.txt

In particuar, check the CUBLAS related section…

Let me paste the relevant section for your convenience:

I recommend you to check the “Release notes” of the various Toolkit versions frm what you have till 3.2RC to track the issue…

Thanks,

Best Regards,

Sarnath

Jim,

At least, from what I remember, NVIDIA had said that DP performance reaches 1/2 of SP performance… long time back – even before they officially released FERMI.

btw, Can you tell me more about “memory shuffling”?

Pezet,

Thanks for considering my advice. but please be aware that I am not an expert in Fermi… Wish you good luck!

I remember that FERMI DGEMM was running at 30% lesser than Peak performance sometime back. It was fixed later. Check this URL…

http://developer.download.nvidia.com/compu…Notes_Linux.txt

In particuar, check the CUBLAS related section…

Let me paste the relevant section for your convenience:

I recommend you to check the “Release notes” of the various Toolkit versions frm what you have till 3.2RC to track the issue…

Thanks,

Best Regards,

Sarnath

Pezet,

3.2 RC2 just released yesterday. The release note highlights says that CUBLAS performance has increased 50% to 300% for all data-types and APIs. You may want to check this out.

Pezet,

3.2 RC2 just released yesterday. The release note highlights says that CUBLAS performance has increased 50% to 300% for all data-types and APIs. You may want to check this out.

hello Sarnath,

thank you very much for the hints on 3.2 RC2.

I will install 3.2 RC2 in the next few days and I will keep you posted on the results :-)

peter

hello Sarnath,

thank you very much for the hints on 3.2 RC2.

I will install 3.2 RC2 in the next few days and I will keep you posted on the results :-)

peter

Nvidia released the final version of 3.2 a few days ago. You might want to use that rather than 3.2 RC2

Nvidia released the final version of 3.2 a few days ago. You might want to use that rather than 3.2 RC2

Ehh, where exactly is 3.2 released? I cannot find it on the nvidia pages.

Ehh, where exactly is 3.2 released? I cannot find it on the nvidia pages.