I have been trying to get timing information using the cutcreatetimer(). I have been running a loop multiple times in order to get accurate timing results. Now for just 1 iteration I get a timing info as under:
first argument is the image size: 24 in this case and the next one is the number of iterations. For instance in the last run, it takes 271ms for 100 iterations and hence 2.71ms for one iteration. It is different from other runs. Can any one suggest why the discrepancy is there?
The loop involves device host transfers and kernel call.