partial coalescence in one warp

I’ve been reading about coalescence in the book “programming massively parallel processors”. As what I understand about multiple-thread coalescence, it only happens when all threads residing in the same warp access consecutive global memory locations. However somewhere further in the book they show a matrix multiplication algorithm which is supposed to be an example where coalescence applies, but when the width of the matrix exceeds the tilesize, the actual addresses that are read consecutively are divided in N parts. This means that every (warp size)/N thread the relative address jump from the two consecutive threads is higher than one.

Here is the code for reference:

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int width)

{

    __shared__float Mds[TILE_WIDTH][TILE_WIDTH];

    __shared__float Nds[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

   int tx = threadIdx.x; int ty = threadIdx.y

//identify the row and colum of the Pd element to work on

   int Row = by * TILE_WIDTH + ty;

   int Col = bx * TILE_WIDTH + tx;

float Pvalue = 0;

//loop over the Md and Nd tiles required to compute the Pd element

   for (int m = 0; m < Width/TILE_WIDTH; ++m)

   {

      //collaborative loading of Md and Nd tiles required to compute the Pd element

      Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];

      Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + col];

      __syncthreads();

for(int k = 0; k < TILE_WIDTH; ++k)

         Pvalue += Mds[ty][k] * Nds[k][tx];

}

   Pd[Row][Col] = Pvalue;

}

Now my question: When the read addresses in one warp are only partially consecutive, does this mean that partial coalescence occurs?

If not so, than either the example is wrong or I’ve not truly understood how coalescence applies on grid blocks.

I’ve been reading about coalescence in the book “programming massively parallel processors”. As what I understand about multiple-thread coalescence, it only happens when all threads residing in the same warp access consecutive global memory locations. However somewhere further in the book they show a matrix multiplication algorithm which is supposed to be an example where coalescence applies, but when the width of the matrix exceeds the tilesize, the actual addresses that are read consecutively are divided in N parts. This means that every (warp size)/N thread the relative address jump from the two consecutive threads is higher than one.

Here is the code for reference:

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int width)

{

    __shared__float Mds[TILE_WIDTH][TILE_WIDTH];

    __shared__float Nds[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

   int tx = threadIdx.x; int ty = threadIdx.y

//identify the row and colum of the Pd element to work on

   int Row = by * TILE_WIDTH + ty;

   int Col = bx * TILE_WIDTH + tx;

float Pvalue = 0;

//loop over the Md and Nd tiles required to compute the Pd element

   for (int m = 0; m < Width/TILE_WIDTH; ++m)

   {

      //collaborative loading of Md and Nd tiles required to compute the Pd element

      Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];

      Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + col];

      __syncthreads();

for(int k = 0; k < TILE_WIDTH; ++k)

         Pvalue += Mds[ty][k] * Nds[k][tx];

}

   Pd[Row][Col] = Pvalue;

}

Now my question: When the read addresses in one warp are only partially consecutive, does this mean that partial coalescence occurs?

If not so, than either the example is wrong or I’ve not truly understood how coalescence applies on grid blocks.

I admit, didn’t read carefully the code, but for your question there are two answers:

For devices of compute capability 1.1: No.

For devices of compute capability >1.2: Yes.

For more details, check the Programming Guide, Appendix G.3.2 (page 151 in 3.1 version)

I admit, didn’t read carefully the code, but for your question there are two answers:

For devices of compute capability 1.1: No.

For devices of compute capability >1.2: Yes.

For more details, check the Programming Guide, Appendix G.3.2 (page 151 in 3.1 version)