Programming Massively Parallel Processors 6.6 Thread Granularity

I bought NVIDIA’s textbook, and I’m up to chapter 6. It discusses thread granularity a bit, then asks the reader to implement the example code using it, but I cannot figure it out. Has anyone else out there gotten this to work?


I don’t have an answer for you but can you tell us what you think of the book? Good? Useful? Thorough?

So far I have learned a lot, It’s a easy read if your a fairly new to CUDA like I am. I haven’t read the programming guide but I think a good amount of it would be covered in there. I did finally figure out the solution, I’ll post it below in case anyone in the future is looking for it.

//-----------------Thread Granularity------------------------------------//

__global__ void MatrixMulKernel1(float*Md, float*Nd, float*Pd, int width)


	__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];

	__shared__ float Nds[TILE_WIDTH][TILE_WIDTH*4];

	int bx = blockIdx.x; int by = blockIdx.y;

	int tx = threadIdx.x; int ty = threadIdx.y;

	//2D thread ID

	int Row = by*TILE_WIDTH+ty;

	int Col = 4*bx*TILE_WIDTH+tx;

	float Pvalue1 = 0;

	float Pvalue2 = 0;

	float Pvalue3 = 0;

	float Pvalue4 = 0;

	for(int k = 0; k < width/TILE_WIDTH; ++k)


		Mds[ty][tx] = Md[Row*width +((k)*TILE_WIDTH+tx)];

		Nds[ty][tx] = Nd[(k*TILE_WIDTH + ty)*width+Col];

		Nds[ty][tx+TILE_WIDTH] = Nd[(k*TILE_WIDTH + ty)*width+Col+TILE_WIDTH];

		Nds[ty][tx+2*TILE_WIDTH] = Nd[(k*TILE_WIDTH + ty)*width+Col+2*TILE_WIDTH];

		Nds[ty][tx+3*TILE_WIDTH] = Nd[(k*TILE_WIDTH + ty)*width+Col+3*TILE_WIDTH];


		for(int m = 0; m < TILE_WIDTH; ++m)


			Pvalue1 += Mds[ty][m]*Nds[m][tx];

			Pvalue2 += Mds[ty][m]*Nds[m][tx+TILE_WIDTH];

			Pvalue3 += Mds[ty][m]*Nds[m][tx+TILE_WIDTH*2];

			Pvalue4 += Mds[ty][m]*Nds[m][tx+TILE_WIDTH*3];




	Pd[Row*width+Col] = Pvalue1;

	Pd[Row*width+Col+TILE_WIDTH] = Pvalue2;

	Pd[Row*width+Col+TILE_WIDTH * 2] = Pvalue3;

	Pd[Row*width+Col+TILE_WIDTH * 3] = Pvalue4;