 # Programming Massively Parallel Processors 6.6 Thread Granularity

I bought NVIDIA’s textbook, and I’m up to chapter 6. It discusses thread granularity a bit, then asks the reader to implement the example code using it, but I cannot figure it out. Has anyone else out there gotten this to work?

Thanks

I don’t have an answer for you but can you tell us what you think of the book? Good? Useful? Thorough?

So far I have learned a lot, It’s a easy read if your a fairly new to CUDA like I am. I haven’t read the programming guide but I think a good amount of it would be covered in there. I did finally figure out the solution, I’ll post it below in case anyone in the future is looking for it.

``````//-----------------Thread Granularity------------------------------------//

__global__ void MatrixMulKernel1(float*Md, float*Nd, float*Pd, int width)

{

__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];

__shared__ float Nds[TILE_WIDTH][TILE_WIDTH*4];

int bx = blockIdx.x; int by = blockIdx.y;

int Row = by*TILE_WIDTH+ty;

int Col = 4*bx*TILE_WIDTH+tx;

float Pvalue1 = 0;

float Pvalue2 = 0;

float Pvalue3 = 0;

float Pvalue4 = 0;

for(int k = 0; k < width/TILE_WIDTH; ++k)

{

Mds[ty][tx] = Md[Row*width +((k)*TILE_WIDTH+tx)];

Nds[ty][tx] = Nd[(k*TILE_WIDTH + ty)*width+Col];

Nds[ty][tx+TILE_WIDTH] = Nd[(k*TILE_WIDTH + ty)*width+Col+TILE_WIDTH];

Nds[ty][tx+2*TILE_WIDTH] = Nd[(k*TILE_WIDTH + ty)*width+Col+2*TILE_WIDTH];

Nds[ty][tx+3*TILE_WIDTH] = Nd[(k*TILE_WIDTH + ty)*width+Col+3*TILE_WIDTH];

for(int m = 0; m < TILE_WIDTH; ++m)

{

Pvalue1 += Mds[ty][m]*Nds[m][tx];

Pvalue2 += Mds[ty][m]*Nds[m][tx+TILE_WIDTH];

Pvalue3 += Mds[ty][m]*Nds[m][tx+TILE_WIDTH*2];

Pvalue4 += Mds[ty][m]*Nds[m][tx+TILE_WIDTH*3];

}