Tesla S2050 performance double precision performance too low

Sarnath · October 10, 2010, 2:28pm

THat is because, the binomialOptions is not implemented in the most effective way. It performs redundant memory accesses and redundand computations as well… Its not well optimized. And, it is not full of "MAD"s to bring out the performance difference you are looking for.

To see the 1/8th performance difference that you want to see – you need to profile a typical DGEMM routine (that is full of MADs) on GTX 480 and C2050 to bring out the perf-difference. Make sure you use the CUBLAS that is optimized for C2050. I hope NVIDIA has released it with the toolkit.

OR

You can write a “mad” kernel that performs only MADs… Like what Vvolkov and others do: “a = a*b+c”.
Since FERMI has dual-issue, it would be a good idea to keep an “even” number of active warps. I think this was discussed by vvolkov and others (though I did not follow it completely… you may want to re-check on what they were talking about)

Sarnath · October 10, 2010, 2:28pm

THat is because, the binomialOptions is not implemented in the most effective way. It performs redundant memory accesses and redundand computations as well… Its not well optimized. And, it is not full of "MAD"s to bring out the performance difference you are looking for.

To see the 1/8th performance difference that you want to see – you need to profile a typical DGEMM routine (that is full of MADs) on GTX 480 and C2050 to bring out the perf-difference. Make sure you use the CUBLAS that is optimized for C2050. I hope NVIDIA has released it with the toolkit.

OR

You can write a “mad” kernel that performs only MADs… Like what Vvolkov and others do: “a = a*b+c”.
Since FERMI has dual-issue, it would be a good idea to keep an “even” number of active warps. I think this was discussed by vvolkov and others (though I did not follow it completely… you may want to re-check on what they were talking about)

pezet · October 15, 2010, 9:22am

Hi Sarnath,

I now changed the simpleCUBLAS example from Sgemm to Dgemm and with N=(1024*7), which is about the largest N to fit onto the GTX 480 card, I get these execution times:

Tesla S2050 (1 GPU): 4394 ms

GTX 480 : 4880 ms

for N=4096

Tesla : 858 ms

GTX480 : 941 ms

So, finally, Tesla is slightly ahead of the GTX 480 ;-)

On the other hand it might be possible that the difference stems from a faster system bus as Tesla is connected to a Xeon system and GTX480 is connected to an i7 pc.

Anyway, my conclusion for now is

a) it must be extremely difficult to exploit Tesla’s improved dp units

B) I easily get the same (if not higher) performance from GTX 480, for about 1/10 of the price

c) if large GPU memory is not an issue, then Tesla is not worth the premium at all

thank you for your help.

THat is because, the binomialOptions is not implemented in the most effective way. It performs redundant memory accesses and redundand computations as well… Its not well optimized. And, it is not full of "MAD"s to bring out the performance difference you are looking for.

To see the 1/8th performance difference that you want to see – you need to profile a typical DGEMM routine (that is full of MADs) on GTX 480 and C2050 to bring out the perf-difference. Make sure you use the CUBLAS that is optimized for C2050. I hope NVIDIA has released it with the toolkit.

OR

You can write a “mad” kernel that performs only MADs… Like what Vvolkov and others do: “a = a*b+c”.

Since FERMI has dual-issue, it would be a good idea to keep an “even” number of active warps. I think this was discussed by vvolkov and others (though I did not follow it completely… you may want to re-check on what they were talking about)

pezet · October 15, 2010, 9:22am

Hi Sarnath,

I now changed the simpleCUBLAS example from Sgemm to Dgemm and with N=(1024*7), which is about the largest N to fit onto the GTX 480 card, I get these execution times:

Tesla S2050 (1 GPU): 4394 ms

GTX 480 : 4880 ms

for N=4096

Tesla : 858 ms

GTX480 : 941 ms

So, finally, Tesla is slightly ahead of the GTX 480 ;-)

On the other hand it might be possible that the difference stems from a faster system bus as Tesla is connected to a Xeon system and GTX480 is connected to an i7 pc.

Anyway, my conclusion for now is

a) it must be extremely difficult to exploit Tesla’s improved dp units

B) I easily get the same (if not higher) performance from GTX 480, for about 1/10 of the price

c) if large GPU memory is not an issue, then Tesla is not worth the premium at all

thank you for your help.

THat is because, the binomialOptions is not implemented in the most effective way. It performs redundant memory accesses and redundand computations as well… Its not well optimized. And, it is not full of "MAD"s to bring out the performance difference you are looking for.

To see the 1/8th performance difference that you want to see – you need to profile a typical DGEMM routine (that is full of MADs) on GTX 480 and C2050 to bring out the perf-difference. Make sure you use the CUBLAS that is optimized for C2050. I hope NVIDIA has released it with the toolkit.

OR

You can write a “mad” kernel that performs only MADs… Like what Vvolkov and others do: “a = a*b+c”.

Since FERMI has dual-issue, it would be a good idea to keep an “even” number of active warps. I think this was discussed by vvolkov and others (though I did not follow it completely… you may want to re-check on what they were talking about)

Sarnath · October 15, 2010, 9:36am

Pezet,

I am rather surprised at your results… though I have not used FERMI myself… Which version of cuBLAS are you using? I am assuming that it is the one that is FERMI aware…

Sarnath · October 15, 2010, 9:36am

Pezet,

I am rather surprised at your results… though I have not used FERMI myself… Which version of cuBLAS are you using? I am assuming that it is the one that is FERMI aware…

pezet · October 16, 2010, 8:07am

Hello Sarnath,

I have got centos 5.3, devdriver_3.1_linux_64_256.40, cudatoolkit_3.1_linux_64_rhel5.4, and gpucomputingsdk_3.1_linux installed. At least I believe that cublas is Fermi aware, but MAYBE it is not a good idea to install the toolkist for rhel5.4 on a centos 5.3 system?

The problem is that I have to stick to centos 5.3 for IB driver reasons.

For me the only explanation is that for some reasons only one of four dp units is ‘visible’ to my kernels.

I’ll take your advice and write some simple kernel with heavy local FMA instructions and see what happens…

best wishes,

Peter

pezet · October 16, 2010, 8:07am

Hello Sarnath,

I have got centos 5.3, devdriver_3.1_linux_64_256.40, cudatoolkit_3.1_linux_64_rhel5.4, and gpucomputingsdk_3.1_linux installed. At least I believe that cublas is Fermi aware, but MAYBE it is not a good idea to install the toolkist for rhel5.4 on a centos 5.3 system?

The problem is that I have to stick to centos 5.3 for IB driver reasons.

For me the only explanation is that for some reasons only one of four dp units is ‘visible’ to my kernels.

I’ll take your advice and write some simple kernel with heavy local FMA instructions and see what happens…

best wishes,

Peter

Jimmy_Pettersson · October 16, 2010, 11:35am

Hi Sarnath,

I now changed the simpleCUBLAS example from Sgemm to Dgemm and with N=(1024*7), which is about the largest N to fit onto the GTX 480 card, I get these execution times:

Tesla S2050 (1 GPU): 4394 ms

GTX 480 : 4880 ms

for N=4096

Tesla : 858 ms

GTX480 : 941 ms

So, finally, Tesla is slightly ahead of the GTX 480 ;-)

On the other hand it might be possible that the difference stems from a faster system bus as Tesla is connected to a Xeon system and GTX480 is connected to an i7 pc.

Anyway, my conclusion for now is

a) it must be extremely difficult to exploit Tesla’s improved dp units

B) I easily get the same (if not higher) performance from GTX 480, for about 1/10 of the price

c) if large GPU memory is not an issue, then Tesla is not worth the premium at all

thank you for your help.

I think most of your results are explained by the approximately 20% higher bandwidth as others have already mentioned. The DGEMM example running on CUBLAS 3.1 also seems to be quite compute bound. Some nvidia benchmarks presented in june only reached around 170 DP GFLOPS out of the maximum of about 515. Thus it seems this example doesn’t exploit the extra DP units as you concluded in (a).

So it seems the extra memory shuffling caused by DP ( 2x) makes it compute bound. They should be able to overcome this since the there are now SGEMM codes running at over 1 TFLOP on GTX 480 ( MAGMA ), I haven’t done the numbers but wouldn’t that imply that they are already doing huge amounts of memory shuffling? Any thoughts ?

Jimmy_Pettersson · October 16, 2010, 11:35am

Hi Sarnath,

I now changed the simpleCUBLAS example from Sgemm to Dgemm and with N=(1024*7), which is about the largest N to fit onto the GTX 480 card, I get these execution times:

Tesla S2050 (1 GPU): 4394 ms

GTX 480 : 4880 ms

for N=4096

Tesla : 858 ms

GTX480 : 941 ms

So, finally, Tesla is slightly ahead of the GTX 480 ;-)

On the other hand it might be possible that the difference stems from a faster system bus as Tesla is connected to a Xeon system and GTX480 is connected to an i7 pc.

Anyway, my conclusion for now is

a) it must be extremely difficult to exploit Tesla’s improved dp units

B) I easily get the same (if not higher) performance from GTX 480, for about 1/10 of the price

c) if large GPU memory is not an issue, then Tesla is not worth the premium at all

thank you for your help.

I think most of your results are explained by the approximately 20% higher bandwidth as others have already mentioned. The DGEMM example running on CUBLAS 3.1 also seems to be quite compute bound. Some nvidia benchmarks presented in june only reached around 170 DP GFLOPS out of the maximum of about 515. Thus it seems this example doesn’t exploit the extra DP units as you concluded in (a).

So it seems the extra memory shuffling caused by DP ( 2x) makes it compute bound. They should be able to overcome this since the there are now SGEMM codes running at over 1 TFLOP on GTX 480 ( MAGMA ), I haven’t done the numbers but wouldn’t that imply that they are already doing huge amounts of memory shuffling? Any thoughts ?

Sarnath · October 18, 2010, 5:04am

Jim,

At least, from what I remember, NVIDIA had said that DP performance reaches 1/2 of SP performance… long time back – even before they officially released FERMI.

btw, Can you tell me more about “memory shuffling”?

Pezet,

Thanks for considering my advice. but please be aware that I am not an expert in Fermi… Wish you good luck!

I remember that FERMI DGEMM was running at 30% lesser than Peak performance sometime back. It was fixed later. Check this URL…

http://developer.download.nvidia.com/compu…Notes_Linux.txt

In particuar, check the CUBLAS related section…

Let me paste the relevant section for your convenience:

CUBLAS Related

o The performance of the CUBLAS routine CGEMM has been significantly improved on Fermi architecture for sizes larger than 300x300. Peak performance is reached when ‘k’ is a multiple of 16 and ‘m’ and ‘n’ are multiples of 64.

o The performance of CUBLAS routines {S,D,C,Z}SYRK and {C,Z}HERK on the Fermi architecture has been significantly improved. These routines have been derived respectively from their {S,D,C,Z}GEMM counterparts and have the same requirements to achieve peak performance.

o Improved the performance of many Level 1 BLAS functions in the CUDA CUBLAS library.

Note that functions that implement a reduction such as *dot, *min, and *max are not improved.

o Increased performance for gemm kernels for non block multiple input sizes achieved through MAGMA licensed code. See Acknowlegements section towards the end of this release notes document.

o Performance for ZGEMM has been improved on the Fermi architecture for sizes greater than 256x256. Peak performance is reached when ‘k’ is a multiple of 8 and ‘m’ and ‘n’ are multiples of 32.

o The performance of the CUBLAS routine DGEMM has significantly improved for the Tesla products based on the Fermi architecture (C20XX, S20XX, M20XX). The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: ‘m’ and ‘n’ dimensions are a multiple of 64, the ‘k’ dimension is a multiple of 16, ((m+n)k) > (2784784). The performance of the CUBLAS routine SGEMM has also been significantly improved on Fermi architecture. The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: ‘m’ and ‘n’ dimensions are a multiple of 96, the ‘k’ dimension is a multiple of 16, ((m+n)k) > (2673673).

o Added Radix-7 CUFFT support for GROMACS in CUDA Toolkit 3.2. Added optimizations for transform sizes that contain prime factors of 3, 5 and 7. Transform sizes that can be expressed as (2^i * 3^j * 5^k * 7^l), where i,j,k,l are integer > 0, execute much faster than transform sizes that contain other prime factors. Previous releases of CUFFT were optimized only for 2^i.

I recommend you to check the “Release notes” of the various Toolkit versions frm what you have till 3.2RC to track the issue…

Thanks,

Best Regards,

Sarnath

Sarnath · October 18, 2010, 5:04am

Jim,

At least, from what I remember, NVIDIA had said that DP performance reaches 1/2 of SP performance… long time back – even before they officially released FERMI.

btw, Can you tell me more about “memory shuffling”?

Pezet,

Thanks for considering my advice. but please be aware that I am not an expert in Fermi… Wish you good luck!

I remember that FERMI DGEMM was running at 30% lesser than Peak performance sometime back. It was fixed later. Check this URL…

http://developer.download.nvidia.com/compu…Notes_Linux.txt

In particuar, check the CUBLAS related section…

Let me paste the relevant section for your convenience:

CUBLAS Related

o The performance of the CUBLAS routine CGEMM has been significantly improved on Fermi architecture for sizes larger than 300x300. Peak performance is reached when ‘k’ is a multiple of 16 and ‘m’ and ‘n’ are multiples of 64.

o The performance of CUBLAS routines {S,D,C,Z}SYRK and {C,Z}HERK on the Fermi architecture has been significantly improved. These routines have been derived respectively from their {S,D,C,Z}GEMM counterparts and have the same requirements to achieve peak performance.

o Improved the performance of many Level 1 BLAS functions in the CUDA CUBLAS library.

Note that functions that implement a reduction such as *dot, *min, and *max are not improved.

o Increased performance for gemm kernels for non block multiple input sizes achieved through MAGMA licensed code. See Acknowlegements section towards the end of this release notes document.

o Performance for ZGEMM has been improved on the Fermi architecture for sizes greater than 256x256. Peak performance is reached when ‘k’ is a multiple of 8 and ‘m’ and ‘n’ are multiples of 32.

o The performance of the CUBLAS routine DGEMM has significantly improved for the Tesla products based on the Fermi architecture (C20XX, S20XX, M20XX). The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: ‘m’ and ‘n’ dimensions are a multiple of 64, the ‘k’ dimension is a multiple of 16, ((m+n)k) > (2784784). The performance of the CUBLAS routine SGEMM has also been significantly improved on Fermi architecture. The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: ‘m’ and ‘n’ dimensions are a multiple of 96, the ‘k’ dimension is a multiple of 16, ((m+n)k) > (2673673).

o Added Radix-7 CUFFT support for GROMACS in CUDA Toolkit 3.2. Added optimizations for transform sizes that contain prime factors of 3, 5 and 7. Transform sizes that can be expressed as (2^i * 3^j * 5^k * 7^l), where i,j,k,l are integer > 0, execute much faster than transform sizes that contain other prime factors. Previous releases of CUFFT were optimized only for 2^i.

I recommend you to check the “Release notes” of the various Toolkit versions frm what you have till 3.2RC to track the issue…

Thanks,

Best Regards,

Sarnath

Sarnath · October 28, 2010, 4:48am

Pezet,

3.2 RC2 just released yesterday. The release note highlights says that CUBLAS performance has increased 50% to 300% for all data-types and APIs. You may want to check this out.

Sarnath · October 28, 2010, 4:48am

Pezet,

3.2 RC2 just released yesterday. The release note highlights says that CUBLAS performance has increased 50% to 300% for all data-types and APIs. You may want to check this out.

pezet · November 14, 2010, 8:05pm

hello Sarnath,

thank you very much for the hints on 3.2 RC2.

I will install 3.2 RC2 in the next few days and I will keep you posted on the results :-)

peter

pezet · November 14, 2010, 8:05pm

hello Sarnath,

thank you very much for the hints on 3.2 RC2.

I will install 3.2 RC2 in the next few days and I will keep you posted on the results :-)

peter

Shining_Arcanine · November 15, 2010, 2:55am

Nvidia released the final version of 3.2 a few days ago. You might want to use that rather than 3.2 RC2

Shining_Arcanine · November 15, 2010, 2:55am

Nvidia released the final version of 3.2 a few days ago. You might want to use that rather than 3.2 RC2

E.D_Riedijk · November 15, 2010, 7:37am

Ehh, where exactly is 3.2 released? I cannot find it on the nvidia pages.

E.D_Riedijk · November 15, 2010, 7:37am

Ehh, where exactly is 3.2 released? I cannot find it on the nvidia pages.