Regarding repeatability of results

Hi all,

I am trying to get my hands wet in the latest CUDA release. I am following the CUDA article on dr. Dobbs site. Based on the discussion there, I have created two versions one normal reverse array code and another using shared memory. Now I know that using shared memory wont make much difference for such a small problem but my question is about repeatability of results. I did 10 runs for each invokation and average the kernel runtime and here are the results that I obtained on my Quadro FX 5800.

Reverse Array Unoptimized (not using shared memory) runtime in msecs.

0.211290

0.097023

0.106995

0.106576

0.090240

0.086499

0.109024

0.241949

0.101910

0.090102

And here are the results for Optimized code using shared memory.

Reverse Array Optimized (using shared memory) runtime in msecs.

0.081168

0.200502

0.108006

0.073210

0.088573

0.085309

0.084102

0.232170

0.089037

0.085626

Now as u can see the results are not repeatable and they vary across different runs. My question is in such cases how do u report the timings of your kernels in a research paper? What is the method used by others to circumvent this variation across different runs.

Fix the bugs first, then time the result. External Image

Which bugs are u talking about? I am talking about the reversearray shared memory CUDA code given here

and the unoptimized code given here

I timed the results and on each run it is different NOw u are saying that it has bugs which bugs does this code have? THe author of that article did not mention anything on this, Could u mind telling me where the bugs are since I am just a beginner?

Oh sorry, I read your post as meaning that the results of your kernel are not repeatable (as opposed to the timing). To get stable execution times, increase the amount of work done per kernel invocation until the relative variation decreases to an acceptable level.

I tried to profile the two codes using the visual profiler. IN the original article (CUDA, Supercomputing for the Masses: Part 6 | Dr Dobb's), the author says the shared memory version has 0 incoherent stores (gst_incoherent) but i cannot find any counter by that name in the current visual profiler. Infact when I compare the two outputs all of the loads/stores are exactly the same. See the two attachments which are snapshots from the profile output. The first being the unoptimized and the second being the optimized output.

The only difference is in two columns, 1) instructions and 2) the instruction throughput (not shown in the snapshot though). So what are the incoherent load/store called in the current visual profiler?