I am measuring the time of my GPU application by measuring time before and after a loop that runs 10 times and includes:
-memory allocation on the GPU
-data transfer to the GPU
-calling of several kernels
-data transfer back to the CPU
-memory deallocation
The average time measured by the testing executable is ~20ms. However when I use the Visual Profiler and sum up the numbers at the CPU time column the result is ~11ms. For different amounts of data (thus different computational intensity on the GPU) I get similar results (~9-15ms difference). Is there an explanation for this difference? Is there a CPU overhead that is not measured by the profiler and slows down the application (where could so much time be spent)?
Thank you