Tesla C870 slower than GForce 9600 GT ?

Hello everybody,

I’m currently developing a CUDA application and testing it on two different machines : my laptop equipped with NVIDIA Geforce 9600 GT and a machine using Tesla Card C870.
As a recall, the Tesla owns much more cores and Multi-Processors and enjoys a very higher computation rate in terms of GFLops.
The sample programs provided by SDK confirm these features. (clock, reduction, …)
my program can be summarized as 3 kernels using for all of them blocks of dimension (16,16).

The result is not what i expect. The execution time is twice more on the tesla card. (it is accurately twice)
The Cuda Profiler indicates one main feature on which we could pay attention : there’s 0 divergent branch for all the kernel launched on Tesla, where as one the nvidia we can count from 600 to 7400 of this kind. The instruction throughput is also higher on the 9600 GT! The warp size is 32 for both cards.

My device code holds different numbers of branching.

Are these results coherent ? Could it be due to the devices architecture ?

I precise also i run on Linux, my processor is 64 bits and i’ve tried to set both 32 and 64 bits version of Linux, it does not change the result.

Thanks in advance for your help !

Look at the number of coalesced accesses.

Look at the number of coalesced accesses.

Thanks for your answer.

i will check it, but what is this value supposed to bring as information ?
If the number of coalesced accesses is higher on Tesla, what does it mean ?

Thanks

Thanks for your answer.

i will check it, but what is this value supposed to bring as information ?
If the number of coalesced accesses is higher on Tesla, what does it mean ?

Thanks

If we assume that my code should focus more on the coalesced access to global memory, why the difference would be significant between the two cards ?

If we assume that my code should focus more on the coalesced access to global memory, why the difference would be significant between the two cards ?