Computer Vision algorithm trying to find the best implementation

hi there,

I am trying to implement a blob detection algorithm from my very own in CUDA to get high FPS on large images
and actually i want to put it in an external object for the real time environnement Jitter from Cycling’74
for the moment, the image is handled in a uchar array both in the host and on the device
when I just transfert the 640x480 image from host to device and back with cudaMemcpy() I reach 512 fps which is the maximum in my runtime environnement
but when i am invoking a global function which doesn’t do anything, the frame rate slow down to 470 fps with a resolutiçon of 320x240 and to 250 fps with a 640x480
whereas the CPU only version stay at 512 fps in all resolution…

so i am wondering if this is due to my graphic card (one Geforce 8600 GT of 512 Mo of memory) or if I should use cudaArray instead of uchar array, or if I can improve the performance by choosing a better dimBlock / dimGrid or something else…

my complete configuration is :
Quad Core 2.50GHz
3 Go of RAM
Geforce 8600 GT with 512 Mo with driver 182.08 and CUDA 2.1 toolkit/SDK
Windows XP pro SP3

best regards