GPU is fine but...

Just finished one kernel which does 5x7 2d filtering on 1366x768 image and then changes the colorspace from YUV to RGB less then 7 ms. Thats great since I want to be able to see the results in real-time for 50 Hz = 20 ms for one frame. There is the exact problem.
Since the CPU cannot work independently from GPU the image aquisition becomes a bigger problem than the Computational complexity of the algorithm. But the hardware is able to process all in real-time. One memcpy of that frame costs 4 ms, and on the top of that is putting the image into host’s memory and that makes the real-time implementation nearly impossible.
This can be solved by DMA-ing into GPU memory and at least setting the CPU free by the time GPU is working. That is why I really am curious when this types of features will be enabled by CUDA team ?