Hi Anshu,

It’s difficult to understand your program, because you have no bounds checks in your program and you don’t give the size of the grid, nor M and N. This is a problem if M and N are not multiples of 16, resulting in out of bounds access of “sum”. For the sake of analysis, I assume that you meant to make the dimensions of “shar_sum” a multiple of your block size, not 36. In addition, to simplify the problem even more, I just made N, M, and shar_sum all just 32 instead of 36. Thus, the 16 by 16 block fits perfectly into “sum”; the grid=(2,2,1) and block=(16,16,1). I took your code, and added printf’s to see how the code accesses the array “sum” versus threadIdx.x with constant threadIdx.y. Here is the code and partial output on the first iteration of j in the for-loop:

```
__global__ void myKernel(int * sum, int W, int H)
{
int tx = blockIdx.x * blockDim.x + threadIdx.x;
int ty = blockIdx.y * blockDim.y + threadIdx.y;
__shared__ int shar_sum[32*32];
for ( int j= 3* threadIdx.y ; j <3*(threadIdx.y+1) && j<32 ;j++)
// each thread will load three rows of the 32x32 matrix.
{
for ( int i= threadIdx.x; i < 32; i+=16)
{
printf("threadIdx.x, y = %d, %d, tx, ty = %d, %d, i = %d, j = %d, shar_sum[%d], sum[%d]\n", threadIdx.x, threadIdx.y, tx, ty, i, j, i+(j*32), (tx+(i-threadIdx.x)+ ( ty+(j-threadIdx.y))*W));
shar_sum[i+(j*32)] = sum[(tx+(i-threadIdx.x)+ ( ty+(j-threadIdx.y))*W)];
}//i
}//j
__syncthreads();
//........ Processing
}
Output:
threadIdx.x, y = 0, 0, tx, ty = 0, 0, i = 0, j = 0, shar_sum[0], sum[0]
threadIdx.x, y = 1, 0, tx, ty = 1, 0, i = 1, j = 0, shar_sum[1], sum[1]
threadIdx.x, y = 2, 0, tx, ty = 2, 0, i = 2, j = 0, shar_sum[2], sum[2]
threadIdx.x, y = 3, 0, tx, ty = 3, 0, i = 3, j = 0, shar_sum[3], sum[3]
threadIdx.x, y = 4, 0, tx, ty = 4, 0, i = 4, j = 0, shar_sum[4], sum[4]
threadIdx.x, y = 5, 0, tx, ty = 5, 0, i = 5, j = 0, shar_sum[5], sum[5]
threadIdx.x, y = 6, 0, tx, ty = 6, 0, i = 6, j = 0, shar_sum[6], sum[6]
threadIdx.x, y = 7, 0, tx, ty = 7, 0, i = 7, j = 0, shar_sum[7], sum[7]
threadIdx.x, y = 8, 0, tx, ty = 8, 0, i = 8, j = 0, shar_sum[8], sum[8]
threadIdx.x, y = 9, 0, tx, ty = 9, 0, i = 9, j = 0, shar_sum[9], sum[9]
threadIdx.x, y = 10, 0, tx, ty = 10, 0, i = 10, j = 0, shar_sum[10], sum[10]
threadIdx.x, y = 11, 0, tx, ty = 11, 0, i = 11, j = 0, shar_sum[11], sum[11]
threadIdx.x, y = 12, 0, tx, ty = 12, 0, i = 12, j = 0, shar_sum[12], sum[12]
threadIdx.x, y = 13, 0, tx, ty = 13, 0, i = 13, j = 0, shar_sum[13], sum[13]
threadIdx.x, y = 14, 0, tx, ty = 14, 0, i = 14, j = 0, shar_sum[14], sum[14]
threadIdx.x, y = 15, 0, tx, ty = 15, 0, i = 15, j = 0, shar_sum[15], sum[15]
(next iteration of i)
threadIdx.x, y = 0, 0, tx, ty = 0, 0, i = 16, j = 0, shar_sum[16], sum[16]
threadIdx.x, y = 1, 0, tx, ty = 1, 0, i = 17, j = 0, shar_sum[17], sum[17]
threadIdx.x, y = 2, 0, tx, ty = 2, 0, i = 18, j = 0, shar_sum[18], sum[18]
threadIdx.x, y = 3, 0, tx, ty = 3, 0, i = 19, j = 0, shar_sum[19], sum[19]
threadIdx.x, y = 4, 0, tx, ty = 4, 0, i = 20, j = 0, shar_sum[20], sum[20]
threadIdx.x, y = 5, 0, tx, ty = 5, 0, i = 21, j = 0, shar_sum[21], sum[21]
threadIdx.x, y = 6, 0, tx, ty = 6, 0, i = 22, j = 0, shar_sum[22], sum[22]
threadIdx.x, y = 7, 0, tx, ty = 7, 0, i = 23, j = 0, shar_sum[23], sum[23]
threadIdx.x, y = 8, 0, tx, ty = 8, 0, i = 24, j = 0, shar_sum[24], sum[24]
threadIdx.x, y = 9, 0, tx, ty = 9, 0, i = 25, j = 0, shar_sum[25], sum[25]
threadIdx.x, y = 10, 0, tx, ty = 10, 0, i = 26, j = 0, shar_sum[26], sum[26]
threadIdx.x, y = 11, 0, tx, ty = 11, 0, i = 27, j = 0, shar_sum[27], sum[27]
threadIdx.x, y = 12, 0, tx, ty = 12, 0, i = 28, j = 0, shar_sum[28], sum[28]
threadIdx.x, y = 13, 0, tx, ty = 13, 0, i = 29, j = 0, shar_sum[29], sum[29]
threadIdx.x, y = 14, 0, tx, ty = 14, 0, i = 30, j = 0, shar_sum[30], sum[30]
threadIdx.x, y = 15, 0, tx, ty = 15, 0, i = 31, j = 0, shar_sum[31], sum[31]
```

Now, let’s assume that one half warp is mapped into threadIdx.x = 0 … 15, with threadIdx.y = 0. (I would have to check this, but it seems reasonable.) From the output, “sum” is accessed sequentially, contiguously, and in multiples of 4 for each thread in sequence. This should result in coalescing global memory accesses. Whether this works in your original code, I don’t know. As far as changing this to work on 8 byte quantities, instead of 4, who knows.

For these things, I usually use printf to check the functionality of the program. But, since you run on a 1.2 device, you don’t have access to printf in the kernel. In that case, you should try to run it in a debugger, or on an emulator that does support printf. Another option is to use the CUDA profiler, because it should show you the number of coalesced accesses. But I’ve had problems with the profiler displaying the correct number of coalesced accesses, so I don’t use it that often.

Ken