I downloaded the question set from the Documentation page (under trainings and tutorials). All other parts went on fine and my solution to Question 6 is correct (basically to optimize the reversing of an array by using shared memory).
However, by using the profiler I found that my program did not see any improvement on GPU time or GLD_UNCOALESCED value.
I compared the solution with my code and come down to the following point (s_data is the shared memory, d_in is the input array):
This is mine:
s_data[threadIdx.x] = d_in[blockIdx.x * blockDim.x + blockDim.x - 1 - threadIdx.x];
This is the solution:
s_data[blockDim.x - 1 - threadIdx.x] = d_in[blockDim.x * blockIdx.x + threadIdx.x];
Could someone tell me why in the latter case there is no incoherent loads/stores in the latter case?
I have included my CU code in the attachment. Thanks guys in advance!
reverseArray_multiblock_fast.rar (2.33 KB)