I am trying to get my hands wet in the latest CUDA release. I am following the CUDA article on dr. Dobbs site. Based on the discussion there, I have created two versions one normal reverse array code and another using shared memory. Now I know that using shared memory wont make much difference for such a small problem but my question is about repeatability of results. I did 10 runs for each invokation and average the kernel runtime and here are the results that I obtained on my Quadro FX 5800.
Reverse Array Unoptimized (not using shared memory) runtime in msecs. 0.211290 0.097023 0.106995 0.106576 0.090240 0.086499 0.109024 0.241949 0.101910 0.090102
And here are the results for Optimized code using shared memory.
Reverse Array Optimized (using shared memory) runtime in msecs. 0.081168 0.200502 0.108006 0.073210 0.088573 0.085309 0.084102 0.232170 0.089037 0.085626
Now as u can see the results are not repeatable and they vary across different runs. My question is in such cases how do u report the timings of your kernels in a research paper? What is the method used by others to circumvent this variation across different runs.