I downloaded the question set from the Documentation page (under trainings and tutorials). All other parts went on fine and my solution to Question 6 is correct (basically to optimize the reversing of an array by using shared memory).
However, by using the profiler I found that my program did not see any improvement on GPU time or GLD_UNCOALESCED value.
I compared the solution with my code and come down to the following point (s_data is the shared memory, d_in is the input array):
It looks like it’s because of the minuses. -1 - threadIdx.x will break coalescing. blockDim should be a multiple of 16, and then threadIdx.x must match the right address.
Rule:
threadIdx.x % 16 == globalAddress % 16
So, thread 0 reads address 0, or address 16, etc. Thread 1 reads address 1 or 17 or … etc.
When you subtract with threadIdx.x, you break the rule, making thread 1 read address 15 or 14.
Does this help? Can be explained deeper, maybe this hint helps.
Has anyone actually recorded consistend time differences between the two versions? I used the code from the solutions yet there’s no clear optimization diference between the two. Also the uncoalesced values are reported 0 for both versions… not sure this should be normal for the initial version.
Are you running GTX 260/280? The coalescing rules for this hardware generation have been changed, and the profiler currently reports 0 uncoalesced reads & writes on these cards.
Right, GT200-based GPUs do in-hardware, what the exercise does in software. See the latest programming guide section on gmem coalescing for compute capability 1.0 and 1.1 vs compute capability 1.2 and greater. The exercise will show a significant improvement on pre-GT200 hardware, but none on GT200.