You probably want to use a timer with higher resolution. In addition you might want to execute the memory copy a couple of times and take the median, mean, minimum or maximum. A higher resolution timer for windows can be found here for example: c++ - How to use QueryPerformanceCounter? - Stack Overflow
Thanks this helps me a lot. Is there way to directly display data in graphic card instead of sending it back to CPU and using
bitmap.anim ? Im using cuda 4.0 and it says it can directly send data to displaying gpu but cannot find the method. Thank you.