Device-to-device memory transfer is slower than using CPU! With small buffer!

Hi everyone,

  • When testing my application, i realized that the memory transferring speed in CUDA seem not good at all (specially with small buffer)

  • With 640x480x32bpp image, the speed looks good (about 1600 fps (CUDA) > 1120 fps (CPU)). But with smaller image (like 320x240x32bpp, CUDA can only reach the maximum speed = 3570 fps, too small if compare with using pure C++ code (18,750 fps). Is there any way too improve the transferring speed in CUDA ?_?!

  • I tried to use cudaMemcpy(device to device), the speed is the same with my kernel (about 3570 fps)!