I am trying to implement filter on CUDA. I have developed filter kernel function for that. After kernel execution I have to copy 32400 float values from device to host. And it takes ~2000 milli Seconds.
If I am performing same operation before kernel function execution then it takes ~4 milli seconds.
I am confused after getting different statistics.
Can anybody help me to solve this problem?
I think this might be due to some overheads. If so then is it possible that it is graphics card’s limitation to give such a large amount of overhead???
Thanks in advance. :)