Questions about the MatrixMul program

The code from matrixMul_kernel.cu of CUDA SDK/projects

Some parts make me confused.

aBegin = wA * BLOCK_SIZE * by;

wA is given as 3*block_size, what does the expression above indicate?

why the blockIdx.y is needed in this part? since the block_size is 16 and C is a (816)(5*16) matrix, a thread compute a sub-matrix of C, each block contains 16 threads,so we need 40 blocks, and then I am confused, how do I know the distribution of the threads between the blocks which computes the sub-matirxes.

Is blockIdx.x some constant value here? because I don’t see it have been defined before.

How could the loop from aBegin to aEnd compute all the sub-matrixes? To me it just loop over a row of the matrix A.

I am just getting started and don’t have a clear feeling of cuda, could someone give help?

   // Block index

    int bx = blockIdx.x;

    int by = blockIdx.y;

   // Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;

   // Index of the first sub-matrix of A processed by the block

    int aBegin = wA * BLOCK_SIZE * by;

   // Index of the last sub-matrix of A processed by the block

    int aEnd   = aBegin + wA - 1;

   // Step size used to iterate through the sub-matrices of A

    int aStep  = BLOCK_SIZE;

   // Index of the first sub-matrix of B processed by the block

    int bBegin = BLOCK_SIZE * bx;

   // Step size used to iterate through the sub-matrices of B

    int bStep  = BLOCK_SIZE * wB;

blockIdx and threadIdx are variables that are always defined in global functions as part of cuda. They tell you where in the process grid the current thread is. (there’s also blockDim and threadDim). The variables are used to determine what work the current thread should do. The inner loop of a thread only does that small amount of work.

(BLOCK_SIZEby+ty) is the thread’s row in the process grid and in matrices A and C. (BLOCK_SIZEby) is the first row of the block, it’s also the first row of A that gets loaded into shared memory. wA is the width of A. A[ (BLOCK_SIZE*by)*wA ], as the comment says, is the first element of A that will be loaded.

Thank you.
I am trying to get you idea.

I have some questions, too.

In matrixMul_kernel.cu:
aBegin = wA * BLOCK_SIZE * by

aBegin will equal to 3840.

I knew the A matrix size which is 4880, B matrix size is 12848 so:
C matrix size is 128*80.

But why will aBegin equal to wA * BLOCK_SIZE * by?

Thank you.