I bought NVIDIA’s textbook, and I’m up to chapter 6. It discusses thread granularity a bit, then asks the reader to implement the example code using it, but I cannot figure it out. Has anyone else out there gotten this to work?
Thanks
I bought NVIDIA’s textbook, and I’m up to chapter 6. It discusses thread granularity a bit, then asks the reader to implement the example code using it, but I cannot figure it out. Has anyone else out there gotten this to work?
Thanks
I don’t have an answer for you but can you tell us what you think of the book? Good? Useful? Thorough?
So far I have learned a lot, It’s a easy read if your a fairly new to CUDA like I am. I haven’t read the programming guide but I think a good amount of it would be covered in there. I did finally figure out the solution, I’ll post it below in case anyone in the future is looking for it.
//-----------------Thread Granularity------------------------------------//
__global__ void MatrixMulKernel1(float*Md, float*Nd, float*Pd, int width)
{
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH*4];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
//2D thread ID
int Row = by*TILE_WIDTH+ty;
int Col = 4*bx*TILE_WIDTH+tx;
float Pvalue1 = 0;
float Pvalue2 = 0;
float Pvalue3 = 0;
float Pvalue4 = 0;
for(int k = 0; k < width/TILE_WIDTH; ++k)
{
Mds[ty][tx] = Md[Row*width +((k)*TILE_WIDTH+tx)];
Nds[ty][tx] = Nd[(k*TILE_WIDTH + ty)*width+Col];
Nds[ty][tx+TILE_WIDTH] = Nd[(k*TILE_WIDTH + ty)*width+Col+TILE_WIDTH];
Nds[ty][tx+2*TILE_WIDTH] = Nd[(k*TILE_WIDTH + ty)*width+Col+2*TILE_WIDTH];
Nds[ty][tx+3*TILE_WIDTH] = Nd[(k*TILE_WIDTH + ty)*width+Col+3*TILE_WIDTH];
__syncthreads();
for(int m = 0; m < TILE_WIDTH; ++m)
{
Pvalue1 += Mds[ty][m]*Nds[m][tx];
Pvalue2 += Mds[ty][m]*Nds[m][tx+TILE_WIDTH];
Pvalue3 += Mds[ty][m]*Nds[m][tx+TILE_WIDTH*2];
Pvalue4 += Mds[ty][m]*Nds[m][tx+TILE_WIDTH*3];
}
__syncthreads();
}
Pd[Row*width+Col] = Pvalue1;
Pd[Row*width+Col+TILE_WIDTH] = Pvalue2;
Pd[Row*width+Col+TILE_WIDTH * 2] = Pvalue3;
Pd[Row*width+Col+TILE_WIDTH * 3] = Pvalue4;
}