I used a lot of CUDA timers to measure the time consumption of kernels and memcpys in my code. But the results are very confusing. For the same piece of code, the time measurement by CUDA timer varies very much.
I use following codes to create, start, stop and destroy a CUDA timer:
CUDA doing its job..... CUT_SAFE_CALL(cutStopTimer(timer)); printf("%f ms ", cutGetTimerValue(timer)); CUT_SAFE_CALL(cutDeleteTimer(timer));
OK, now I run the following code:
for(i=0; i< 10000; i++)
copy data from host to CUDA;
stop and read timer1;
run CUDA kernel1
stop and read timer2
run CUDA kernel3
stop and read timer3
Ok, then the results are very wierd. Basically, the sum of three timer measurements is constant. But the individual reading could be very different. For example:
t1+t2+t3 = 0.2 + 100 + 20 = 120 + 0.2 + 0.3 = 0.2 + 120 + 0.2 = … ~= 120
I have no idea what is going on there. I assume that the kernel calling is blocked, so is the CUDAMemcpy. How can I measure the time consumption in CUDA accurately?