Disappointed performance using C2050

athlonshi · September 1, 2010, 7:01pm

I expected that C2050 is much faster than GTX 260 for both single and double-precision operations, while it seems not the case to me.
Using C2050 (with ECC off) for my current kernel code is only 1.5 times faster than that of using GTX 260. More surprisingly, when testing with the SDK matrixMUL example, I only got ~100 GFLOPS for C2050. After increasing the array size of the SDK example, this went up to ~180 GLOPS, but it is still far lower than the theoretical value of 1050 GFLOPS in single precision. Very disappointed performance with the new card. I am think maybe it is not a bad idea to go for four GTX 4XX instead of one Tesla. :(

Uncle_Joe · September 1, 2010, 7:20pm

I also am using a C2050 and expected more speedup, but for now, 1.6 speedup is about right, based on #cores and clock:

1.3 GHz * 216 cores (GTX 260)
___________________________ = 1.83
1.15 GHz * 448 cores (C2050)

Several of the SDK examples perform slower on Fermi than GT200 due to architectural changes (matrix multiply is probably 1 of them). Just give NVIDIA some time to tune & improve the code. For example, recently some University of Virginia student & his professor engineered an incredibly fast GPU radix sort @ 1 G int32s / second - 4x faster than the CUDA SDK radix sort.

Please be more patient. You probably don’t know how difficult || programming can be :)

Uncle_Joe · September 1, 2010, 7:20pm

I also am using a C2050 and expected more speedup, but for now, 1.6 speedup is about right, based on #cores and clock:

1.3 GHz * 216 cores (GTX 260)
___________________________ = 1.83
1.15 GHz * 448 cores (C2050)

Several of the SDK examples perform slower on Fermi than GT200 due to architectural changes (matrix multiply is probably 1 of them). Just give NVIDIA some time to tune & improve the code. For example, recently some University of Virginia student & his professor engineered an incredibly fast GPU radix sort @ 1 G int32s / second - 4x faster than the CUDA SDK radix sort.

Please be more patient. You probably don’t know how difficult || programming can be :)

athlonshi · September 1, 2010, 8:18pm

Understood, and I am not completely frustrated about it. :rolleyes:

I also did the math, but different from the compute capacity of 1.3, the Fermi should have much better performance in double-precision as the it takes 2 clock cycles for a double-precision instruction, as compared to 32 of the compute capacity of 1.3.

If the math is about right, I would definitely go GTX 480, which costs less than 500 bucks, but has 480 cores with 1.4 G frequency. With the same expense, I can buy 4 GTX 480 and do multiple cards programming. In that way, would the performance even better than a single Tesla C2050?

So, my puzzle is that how to take full advantage of the Tesla card.

athlonshi · September 1, 2010, 8:18pm

Understood, and I am not completely frustrated about it. :rolleyes:

I also did the math, but different from the compute capacity of 1.3, the Fermi should have much better performance in double-precision as the it takes 2 clock cycles for a double-precision instruction, as compared to 32 of the compute capacity of 1.3.

If the math is about right, I would definitely go GTX 480, which costs less than 500 bucks, but has 480 cores with 1.4 G frequency. With the same expense, I can buy 4 GTX 480 and do multiple cards programming. In that way, would the performance even better than a single Tesla C2050?

So, my puzzle is that how to take full advantage of the Tesla card.

athlonshi · September 1, 2010, 8:19pm

athlonshi · September 1, 2010, 8:19pm

Uncle_Joe · September 1, 2010, 8:40pm

“With the same expense, I can buy 4 GTX 480 and do multiple cards programming. In that way, would the performance even better than a single Tesla C2050?”

That won’t work. GTX 480’s double precision is artificially limited to 1/4 of Tesla’s

I don’t have much against price discrimination and this looks like a classic buyers want buy low, sellers want to sell high dilemma.

I was going to suggest to see if NVIDIA has discounts for universities, but I don’t think so since all the people I know at school are using GeForce.

Uncle_Joe · September 1, 2010, 8:40pm

“With the same expense, I can buy 4 GTX 480 and do multiple cards programming. In that way, would the performance even better than a single Tesla C2050?”

That won’t work. GTX 480’s double precision is artificially limited to 1/4 of Tesla’s

I don’t have much against price discrimination and this looks like a classic buyers want buy low, sellers want to sell high dilemma.

I was going to suggest to see if NVIDIA has discounts for universities, but I don’t think so since all the people I know at school are using GeForce.

athlonshi · September 1, 2010, 8:51pm

I am just thinking about this cost-wisely as I know NVIDIA artificially reduces the DB performance of GTX 4XX. But anyway GTX 480 should be better than 260, shouldn’t it?

Indeed, Tesla C2050 is on sale now for eduction institutes, which costs much less than its retail price. Otherwise, I would cry loudly by spending 2,500 bucks External Media

athlonshi · September 1, 2010, 8:51pm

I am just thinking about this cost-wisely as I know NVIDIA artificially reduces the DB performance of GTX 4XX. But anyway GTX 480 should be better than 260, shouldn’t it?

Indeed, Tesla C2050 is on sale now for eduction institutes, which costs much less than its retail price. Otherwise, I would cry loudly by spending 2,500 bucks External Media

avidday · September 1, 2010, 8:56pm

I think you need to reset your expectations a bit. Things like matrix multiplication are memory bandwidth limited, not compute limited. Your reference GTX260 has about 110 Gb/s global memory bandwidth. Your C2050 has about 140 Gb/s. Leaving aside the architectural improvements (especially cache, dual issue scheduling and other stuff which can improve IPC), that is only about a 1.25x improvement. For compute limited codes and double precision, the “baseline” speed up can be a lot higher, but it doesn’t automatically follow that a Fermi card will be tremendously faster than a GT200 card for any arbitrary benchmark you might choose.

avidday · September 1, 2010, 8:56pm

I think you need to reset your expectations a bit. Things like matrix multiplication are memory bandwidth limited, not compute limited. Your reference GTX260 has about 110 Gb/s global memory bandwidth. Your C2050 has about 140 Gb/s. Leaving aside the architectural improvements (especially cache, dual issue scheduling and other stuff which can improve IPC), that is only about a 1.25x improvement. For compute limited codes and double precision, the “baseline” speed up can be a lot higher, but it doesn’t automatically follow that a Fermi card will be tremendously faster than a GT200 card for any arbitrary benchmark you might choose.

athlonshi · September 1, 2010, 9:11pm

Yes, I agree with that.

For my code, it is compute limited according to the profile produced by the CUDA profiler. But still, I didn’t see significant speed-up. Of course, there are some other issues, such as code divergence.

athlonshi · September 1, 2010, 9:11pm

Yes, I agree with that.

For my code, it is compute limited according to the profile produced by the CUDA profiler. But still, I didn’t see significant speed-up. Of course, there are some other issues, such as code divergence.

Jimmy_Pettersson · September 1, 2010, 9:40pm

Agree with avviday

C2050 @ 144 GB/s / GTX260 @ 111.9 GB/s => 1.286x

The GTX 480 does around 177 GB/s which should provide you with a significant improvement. I’ve noticed that a majority of applications are bandwidth bound.

Jimmy_Pettersson · September 1, 2010, 9:40pm

Agree with avviday

C2050 @ 144 GB/s / GTX260 @ 111.9 GB/s => 1.286x

The GTX 480 does around 177 GB/s which should provide you with a significant improvement. I’ve noticed that a majority of applications are bandwidth bound.

Lev · September 1, 2010, 10:46pm

I do not think GTX 480 double performance was artificiall limited. However, actual tests show that performance even on compute bound double codes is about only twice slower on GTX. Also note that clock rote of GTX is higher. That way 4 GTX will be faster for sure if program scales well over multi gpu. However, they will consumer more power and occupy more slots.

Lev · September 1, 2010, 10:46pm

I do not think GTX 480 double performance was artificiall limited. However, actual tests show that performance even on compute bound double codes is about only twice slower on GTX. Also note that clock rote of GTX is higher. That way 4 GTX will be faster for sure if program scales well over multi gpu. However, they will consumer more power and occupy more slots.

Marc_B · September 2, 2010, 6:36pm

The GTX 480 has a slightly higher clock speed than the Tesla, but half the memory. Buy the Tesla if you need to allocate big blocks of device memory. You can pin host memory to expand the memory pool, but it will always be faster to stay on chip than access across the bus. Also keep the kernel run time limit in mind; the GTX 480 in my system has a limit (as reported by deviceQuery) but my Tesla doesn’t. Heavy scientific computing may also need the better accuracy from memory ECC in the Tesla line. ECC slows things down, of course, but you can always disable it for more speed. The Tesla is more than fast enough for my needs, and the bigger device memory is more suitable for my application. It all comes down to your needs.

Topic		Replies	Views
Tesla S2050 performance double precision performance too low CUDA Programming and Performance	42	29508	December 8, 2010
Buying Advice C2050/C2070 CUDA Programming and Performance	14	9792	August 15, 2010
GTX 580 is not as good as GTX480 for CUDA ? CUDA Programming and Performance	23	4144	November 7, 2010
Comparing C1060, GTX470, GTX480 and C2050 Benchmark results of the Fermi Cards and Tesla generation CUDA Programming and Performance	9	26004	November 4, 2010
Tesla C2050 slower than GeForce 8800? CUDA Programming and Performance	14	21070	April 20, 2011
GeForce 570 vs. Tesla c2050 CUDA Programming and Performance	3	1836	August 16, 2011
Tesla C2070 Performance Comparing Tesla C2070 performance to Geforce GTX CUDA Programming and Performance	4	2631	March 24, 2011
Noob Alert: Tesla K20 slower than GTX 580? CUDA Programming and Performance	24	9442	November 3, 2013
Lower then expected bandwidth on C2050 CUDA Programming and Performance	11	9202	October 26, 2010
Tesla C2050 performance comparision with C1060 CUDA Programming and Performance	63	10716	September 14, 2010

Disappointed performance using C2050

Related topics