Tesla C2050 slower than Tesla T10?

Just got the new Tesla C2050, it shocked me with lower than the old Tesla T10 performance on my code.

Q1:

When I run cudaGetDeviceProperties() on Tesla T10, it tells me T10 has 30 Multi-Processors. However on C2050, it says only 14 Multi-processors. I don’t get it: C2050 has about double number of cores than T10 right?

Q2:

C2050 support sm 2.0 compute capabilitiy, however I can not set to it since VS 2008 compiler tells me it doesn’t support? Why is this happening?

Q3:
Any FERMI advantage I can take given that I don’t have to change the code?

Fermi has 32 cores per multiprocessor. The T10 has 8.

Starting with the Cuda 3.0 toolkit, nvcc supports the sm_20 target. I don’t develop for windows, so I can’t tell you what to change inside visual studio, but if you are using a modern toolkit and template, there should be a way to set the architecture target correctly.

There is now a considerable amount of material discussing available, starting with the compatibility and tuning guides in the Cuda toolkit, and going through to the excellent performance optimiziation tutorials found Here. Most people’s experience is that there will be some tuning to get optimal performance on Fermi. The usual starting places are with playing with ECC and cache/shared memory settings, and then move onto profiling and looking at execution parameters. But every code is different and what works for one might not work for another. All I can suggest is to have a look at some of available porting and tuning material to get a feel for what might be possible.

Double post…