Contradictory results for Bayer Decoding when executed on GPGPU using CUDA-C

I am developing my own CUDA/C programs for Bayer decoding using gradient-based interpolation. I executed my program on a NVIDIA graphics card (Quadro FX 580 with compute capability 1.1) as well as a GPU (Tesla c1060). However, the program is taking 0.08 msec on the graphics card, whereas it took 133 msec on the GPU. It should actually be the opposite…

Also, when executing solely on the graphics-card, the time taken by the same program changes from 0.08 msec to 400 msec just by varying the number of threads per block and blocks per grid.

I can’t figure out the reason for such drastic changes in results.

Image size for the decoding is 2048x2048
Data type: short int