Disappointed performance using C2050

I expected that C2050 is much faster than GTX 260 for both single and double-precision operations, while it seems not the case to me.
Using C2050 (with ECC off) for my current kernel code is only 1.5 times faster than that of using GTX 260. More surprisingly, when testing with the SDK matrixMUL example, I only got ~100 GFLOPS for C2050. After increasing the array size of the SDK example, this went up to ~180 GLOPS, but it is still far lower than the theoretical value of 1050 GFLOPS in single precision. Very disappointed performance with the new card. I am think maybe it is not a bad idea to go for four GTX 4XX instead of one Tesla. :(

I also am using a C2050 and expected more speedup, but for now, 1.6 speedup is about right, based on #cores and clock:

1.3 GHz * 216 cores (GTX 260)
___________________________ = 1.83
1.15 GHz * 448 cores (C2050)

Several of the SDK examples perform slower on Fermi than GT200 due to architectural changes (matrix multiply is probably 1 of them). Just give NVIDIA some time to tune & improve the code. For example, recently some University of Virginia student & his professor engineered an incredibly fast GPU radix sort @ 1 G int32s / second - 4x faster than the CUDA SDK radix sort.

Please be more patient. You probably don’t know how difficult || programming can be :)

I also am using a C2050 and expected more speedup, but for now, 1.6 speedup is about right, based on #cores and clock:

1.3 GHz * 216 cores (GTX 260)
___________________________ = 1.83
1.15 GHz * 448 cores (C2050)

Several of the SDK examples perform slower on Fermi than GT200 due to architectural changes (matrix multiply is probably 1 of them). Just give NVIDIA some time to tune & improve the code. For example, recently some University of Virginia student & his professor engineered an incredibly fast GPU radix sort @ 1 G int32s / second - 4x faster than the CUDA SDK radix sort.

Please be more patient. You probably don’t know how difficult || programming can be :)

Understood, and I am not completely frustrated about it. :rolleyes:

I also did the math, but different from the compute capacity of 1.3, the Fermi should have much better performance in double-precision as the it takes 2 clock cycles for a double-precision instruction, as compared to 32 of the compute capacity of 1.3.

If the math is about right, I would definitely go GTX 480, which costs less than 500 bucks, but has 480 cores with 1.4 G frequency. With the same expense, I can buy 4 GTX 480 and do multiple cards programming. In that way, would the performance even better than a single Tesla C2050?

So, my puzzle is that how to take full advantage of the Tesla card.

Understood, and I am not completely frustrated about it. :rolleyes:

I also did the math, but different from the compute capacity of 1.3, the Fermi should have much better performance in double-precision as the it takes 2 clock cycles for a double-precision instruction, as compared to 32 of the compute capacity of 1.3.

If the math is about right, I would definitely go GTX 480, which costs less than 500 bucks, but has 480 cores with 1.4 G frequency. With the same expense, I can buy 4 GTX 480 and do multiple cards programming. In that way, would the performance even better than a single Tesla C2050?

So, my puzzle is that how to take full advantage of the Tesla card.

“With the same expense, I can buy 4 GTX 480 and do multiple cards programming. In that way, would the performance even better than a single Tesla C2050?”

That won’t work. GTX 480’s double precision is artificially limited to 1/4 of Tesla’s

I don’t have much against price discrimination and this looks like a classic buyers want buy low, sellers want to sell high dilemma.

I was going to suggest to see if NVIDIA has discounts for universities, but I don’t think so since all the people I know at school are using GeForce.

“With the same expense, I can buy 4 GTX 480 and do multiple cards programming. In that way, would the performance even better than a single Tesla C2050?”

That won’t work. GTX 480’s double precision is artificially limited to 1/4 of Tesla’s

I don’t have much against price discrimination and this looks like a classic buyers want buy low, sellers want to sell high dilemma.

I was going to suggest to see if NVIDIA has discounts for universities, but I don’t think so since all the people I know at school are using GeForce.

I am just thinking about this cost-wisely as I know NVIDIA artificially reduces the DB performance of GTX 4XX. But anyway GTX 480 should be better than 260, shouldn’t it?

Indeed, Tesla C2050 is on sale now for eduction institutes, which costs much less than its retail price. Otherwise, I would cry loudly by spending 2,500 bucks :haha:

I am just thinking about this cost-wisely as I know NVIDIA artificially reduces the DB performance of GTX 4XX. But anyway GTX 480 should be better than 260, shouldn’t it?

Indeed, Tesla C2050 is on sale now for eduction institutes, which costs much less than its retail price. Otherwise, I would cry loudly by spending 2,500 bucks :haha:

I think you need to reset your expectations a bit. Things like matrix multiplication are memory bandwidth limited, not compute limited. Your reference GTX260 has about 110 Gb/s global memory bandwidth. Your C2050 has about 140 Gb/s. Leaving aside the architectural improvements (especially cache, dual issue scheduling and other stuff which can improve IPC), that is only about a 1.25x improvement. For compute limited codes and double precision, the “baseline” speed up can be a lot higher, but it doesn’t automatically follow that a Fermi card will be tremendously faster than a GT200 card for any arbitrary benchmark you might choose.

I think you need to reset your expectations a bit. Things like matrix multiplication are memory bandwidth limited, not compute limited. Your reference GTX260 has about 110 Gb/s global memory bandwidth. Your C2050 has about 140 Gb/s. Leaving aside the architectural improvements (especially cache, dual issue scheduling and other stuff which can improve IPC), that is only about a 1.25x improvement. For compute limited codes and double precision, the “baseline” speed up can be a lot higher, but it doesn’t automatically follow that a Fermi card will be tremendously faster than a GT200 card for any arbitrary benchmark you might choose.

Yes, I agree with that.

For my code, it is compute limited according to the profile produced by the CUDA profiler. But still, I didn’t see significant speed-up. Of course, there are some other issues, such as code divergence.

Yes, I agree with that.

For my code, it is compute limited according to the profile produced by the CUDA profiler. But still, I didn’t see significant speed-up. Of course, there are some other issues, such as code divergence.

Agree with avviday

C2050 @ 144 GB/s / GTX260 @ 111.9 GB/s => 1.286x

The GTX 480 does around 177 GB/s which should provide you with a significant improvement. I’ve noticed that a majority of applications are bandwidth bound.

Agree with avviday

C2050 @ 144 GB/s / GTX260 @ 111.9 GB/s => 1.286x

The GTX 480 does around 177 GB/s which should provide you with a significant improvement. I’ve noticed that a majority of applications are bandwidth bound.

I do not think GTX 480 double performance was artificiall limited. However, actual tests show that performance even on compute bound double codes is about only twice slower on GTX. Also note that clock rote of GTX is higher. That way 4 GTX will be faster for sure if program scales well over multi gpu. However, they will consumer more power and occupy more slots.

I do not think GTX 480 double performance was artificiall limited. However, actual tests show that performance even on compute bound double codes is about only twice slower on GTX. Also note that clock rote of GTX is higher. That way 4 GTX will be faster for sure if program scales well over multi gpu. However, they will consumer more power and occupy more slots.

The GTX 480 has a slightly higher clock speed than the Tesla, but half the memory. Buy the Tesla if you need to allocate big blocks of device memory. You can pin host memory to expand the memory pool, but it will always be faster to stay on chip than access across the bus. Also keep the kernel run time limit in mind; the GTX 480 in my system has a limit (as reported by deviceQuery) but my Tesla doesn’t. Heavy scientific computing may also need the better accuracy from memory ECC in the Tesla line. ECC slows things down, of course, but you can always disable it for more speed. The Tesla is more than fast enough for my needs, and the bigger device memory is more suitable for my application. It all comes down to your needs.