When testing my application, i realized that the memory transferring speed in CUDA seem not good at all (specially with small buffer)
With 640x480x32bpp image, the speed looks good (about 1600 fps (CUDA) > 1120 fps (CPU)). But with smaller image (like 320x240x32bpp, CUDA can only reach the maximum speed = 3570 fps, too small if compare with using pure C++ code (18,750 fps). Is there any way too improve the transferring speed in CUDA ?_?!
I tried to use cudaMemcpy(device to device), the speed is the same with my kernel (about 3570 fps)!