Hello everybody,
I’m currently developing a CUDA application and testing it on two different machines : my laptop equipped with NVIDIA Geforce 9600 GT and a machine using Tesla Card C870.
As a recall, the Tesla owns much more cores and Multi-Processors and enjoys a very higher computation rate in terms of GFLops.
The sample programs provided by SDK confirm these features. (clock, reduction, …)
my program can be summarized as 3 kernels using for all of them blocks of dimension (16,16).
The result is not what i expect. The execution time is twice more on the tesla card. (it is accurately twice)
The Cuda Profiler indicates one main feature on which we could pay attention : there’s 0 divergent branch for all the kernel launched on Tesla, where as one the nvidia we can count from 600 to 7400 of this kind. The instruction throughput is also higher on the 9600 GT! The warp size is 32 for both cards.
My device code holds different numbers of branching.
Are these results coherent ? Could it be due to the devices architecture ?
I precise also i run on Linux, my processor is 64 bits and i’ve tried to set both 32 and 64 bits version of Linux, it does not change the result.
Thanks in advance for your help !