The code from matrixMul_kernel.cu of CUDA SDK/projects

Some parts make me confused.

aBegin = wA * BLOCK_SIZE * by;

wA is given as 3*block_size, what does the expression above indicate?

why the blockIdx.y is needed in this part? since the block_size is 16 and C is a (8*16)*(5*16) matrix, a thread compute a sub-matrix of C, each block contains 16 threads,so we need 40 blocks, and then I am confused, how do I know the distribution of the threads between the blocks which computes the sub-matirxes.

Is blockIdx.x some constant value here? because I don’t see it have been defined before.

How could the loop from aBegin to aEnd compute all the sub-matrixes? To me it just loop over a row of the matrix A.

I am just getting started and don’t have a clear feeling of cuda, could someone give help?

```
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
```