I am trying to get my hands wet in the latest CUDA release. I am following the CUDA article on dr. Dobbs site. Based on the discussion there, I have created two versions one normal reverse array code and another using shared memory. Now I know that using shared memory wont make much difference for such a small problem but my question is about repeatability of results. I did 10 runs for each invokation and average the kernel runtime and here are the results that I obtained on my Quadro FX 5800.
Reverse Array Unoptimized (not using shared memory) runtime in msecs.
0.211290
0.097023
0.106995
0.106576
0.090240
0.086499
0.109024
0.241949
0.101910
0.090102
And here are the results for Optimized code using shared memory.
Now as u can see the results are not repeatable and they vary across different runs. My question is in such cases how do u report the timings of your kernels in a research paper? What is the method used by others to circumvent this variation across different runs.
Which bugs are u talking about? I am talking about the reversearray shared memory CUDA code given here
and the unoptimized code given here
I timed the results and on each run it is different NOw u are saying that it has bugs which bugs does this code have? THe author of that article did not mention anything on this, Could u mind telling me where the bugs are since I am just a beginner?
Oh sorry, I read your post as meaning that the results of your kernel are not repeatable (as opposed to the timing). To get stable execution times, increase the amount of work done per kernel invocation until the relative variation decreases to an acceptable level.
I tried to profile the two codes using the visual profiler. IN the original article (CUDA, Supercomputing for the Masses: Part 6 | Dr Dobb's), the author says the shared memory version has 0 incoherent stores (gst_incoherent) but i cannot find any counter by that name in the current visual profiler. Infact when I compare the two outputs all of the loads/stores are exactly the same. See the two attachments which are snapshots from the profile output. The first being the unoptimized and the second being the optimized output.
The only difference is in two columns, 1) instructions and 2) the instruction throughput (not shown in the snapshot though). So what are the incoherent load/store called in the current visual profiler?