CUDA Exercise 6 problems

I downloaded the question set from the Documentation page (under trainings and tutorials). All other parts went on fine and my solution to Question 6 is correct (basically to optimize the reversing of an array by using shared memory).

However, by using the profiler I found that my program did not see any improvement on GPU time or GLD_UNCOALESCED value.

I compared the solution with my code and come down to the following point (s_data is the shared memory, d_in is the input array):

This is mine:

s_data[threadIdx.x] = d_in[blockIdx.x * blockDim.x + blockDim.x - 1 - threadIdx.x];

This is the solution:

s_data[blockDim.x - 1 - threadIdx.x] = d_in[blockDim.x * blockIdx.x + threadIdx.x];

Could someone tell me why in the latter case there is no incoherent loads/stores in the latter case?

I have included my CU code in the attachment. Thanks guys in advance!
reverseArray_multiblock_fast.rar (2.33 KB)

It looks like it’s because of the minuses. -1 - threadIdx.x will break coalescing. blockDim should be a multiple of 16, and then threadIdx.x must match the right address.

Rule:

threadIdx.x % 16 == globalAddress % 16

So, thread 0 reads address 0, or address 16, etc. Thread 1 reads address 1 or 17 or … etc.

When you subtract with threadIdx.x, you break the rule, making thread 1 read address 15 or 14.

Does this help? Can be explained deeper, maybe this hint helps.

Thx for the fast response, by searching the programming guide I found that:

But do I need to worry about this when writing to global address?

Yes. Coalescing must be done on both reads and writes for optimal performance.

yes, you can have both uncoalesced reads & writes.

Has anyone actually recorded consistend time differences between the two versions? I used the code from the solutions yet there’s no clear optimization diference between the two. Also the uncoalesced values are reported 0 for both versions… not sure this should be normal for the initial version.

Are you running GTX 260/280? The coalescing rules for this hardware generation have been changed, and the profiler currently reports 0 uncoalesced reads & writes on these cards.

Yes, I’m using GTX 260. Still, shouldn’t the shared memory reduce the execution time?!

Only if using shared memory reduces the number of reads and writes. I believe in this case shared memory is used to coalesce the reading/writing.

Right, GT200-based GPUs do in-hardware, what the exercise does in software. See the latest programming guide section on gmem coalescing for compute capability 1.0 and 1.1 vs compute capability 1.2 and greater. The exercise will show a significant improvement on pre-GT200 hardware, but none on GT200.

Paulius