The code from matrixMul_kernel.cu of CUDA SDK/projects
Some parts make me confused.
aBegin = wA * BLOCK_SIZE * by;
wA is given as 3*block_size, what does the expression above indicate?
why the blockIdx.y is needed in this part? since the block_size is 16 and C is a (816)(5*16) matrix, a thread compute a sub-matrix of C, each block contains 16 threads,so we need 40 blocks, and then I am confused, how do I know the distribution of the threads between the blocks which computes the sub-matirxes.
Is blockIdx.x some constant value here? because I don’t see it have been defined before.
How could the loop from aBegin to aEnd compute all the sub-matrixes? To me it just loop over a row of the matrix A.
I am just getting started and don’t have a clear feeling of cuda, could someone give help?
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
blockIdx and threadIdx are variables that are always defined in global functions as part of cuda. They tell you where in the process grid the current thread is. (there’s also blockDim and threadDim). The variables are used to determine what work the current thread should do. The inner loop of a thread only does that small amount of work.
(BLOCK_SIZEby+ty) is the thread’s row in the process grid and in matrices A and C. (BLOCK_SIZEby) is the first row of the block, it’s also the first row of A that gets loaded into shared memory. wA is the width of A. A[ (BLOCK_SIZE*by)*wA ], as the comment says, is the first element of A that will be loaded.