Maybe an error on the matrixmul sample of CUDA C

in the kernel : template global void MatrixMulCUDA(float *C, float *A,
float *B, int wA,
int wB) {

i made the substitution :
//int aEnd = aBegin + wA - 1;
int aEnd = aBegin + wA - BLOCK_SIZE;

the commented code is the original version, the uncommented is my version.

I ran both codes, the final test is passed, but i think the original aEnd is too big … (Maybe im wrong!)

Since the only thing aEnd is used for is the loop termination condition, either formulation will work. A governing assumption for this code is that wA is evenly divisible by BLOCK_SIZE.

Since the loop termination condition involves a <= check, the minimum amount to be subtracted from wA to make this work would be 1, and the maximum amount to be subtracted from wA to make it work in the same way is BLOCK_SIZE. Either one will cause the “striding” loop to terminate in the same way.

To me if we don t choose
int aEnd = aBegin + wA - BLOCK_SIZE;
the last submatrix extract of A and B by the block (Bi) belongs to another block … Really strange.

(I know BLOCK_SIZE devides all the matrix dimensions of A,B and C)

but to me aEnd is the first element of the last submatrix (of A and B) extracted by the block (Bi) so it is aBegin + Wa - BLOCK_SIZE (i think by using this way to observe :
https://msdn.microsoft.com/fr-fr/library/hh873134.aspx )

to me if you Watch the tile method, in matrix A the element

_aBegin + Wa - 1 = “4”

aBegin + Wa - BLOCK_SIZE = “3”

The only thing aEnd is used for in that code is the loop termination condition. Since the loop is striding (by BLOCK_SIZE), the termination conditions are the same.

Ah i think i undertood, because of the value of aStep, we have to have :

aBegin + Wa - BLOCK_SIZE <= aEnd <= aBegin + WA -1

to make it work !