Bug in CUDA SDK Sample 'Matrix Multiplication (CUBLAS)'?

Hey guys,

is it just me or is there a bug in the CUDA SDK Sample ‘Matrix Multiplication (CUBLAS)’ of CUDA v6.0 (http://docs.nvidia.com/cuda/cuda-samples/index.html#matrix-multiplication--cublas-)?

There we want to compute A*B=C, where the dimensions of A, B, and C are defined as multiples of
A: 4x2, B: 4x2, C: 4x2 (see lines 206-211). Obviously the dimensions don’t match at all and I was wondering why this even works.

From my understanding, you have to change these lines to multiples of
A: 4x2, B: 2x4, C: 4x4 (or some other constants, which match the dimension constraints).

In addition we have to change the last argument of the function cublasSgemm (the leading dimension of C) to matrix_size.uiWC. With these changes, everything works fine (also with other constants).

Want do you think? Is there a error in my reasoning? Sorry, if this topic is already discussed elsewhere. I didn’t found any mention of this problem in the web.

It’s not a bug. The sample is confusing.

If you parse out the parameters that are being passed to cublasSgemm on line 285, and study the formula here:

http://docs.nvidia.com/cuda/cublas/index.html#cublas-level-3-function-reference

you’ll understand what is going on. Note the parameters passed and default values:

m: matrix_size.uiWB = 320
n: matrix_size.uiHA = 640
k: matrix_size.uiWA = 320

also note that the matrix labelled d_A is being passed as the “B” matrix in the formula, and the matrix labelled d_B is being passed as the “A” matrix in the formula. Combining these we have:

Formula: C=α op(A)op(B)+β C where: op(A)m× k , op(B)k× n and Cm× n (op() = no transpose here)

Sgemm: C=α d_B(mxk) * d_A(kxn) +β C

So, in effect, only “half” of d_B is being used, since it’s allocated at 320x640 but only 320x320 is being used. Note that something similar is happening in matrixMulCPU, based on the size parameters passed.

As a proof-point, change this line of code:

matrix_size.uiHB = 4 * block_size * iSizeMultiple;

to this:

matrix_size.uiHB = 2 * block_size * iSizeMultiple;

and you will see that the program continues to function normally (with a d_B matrix that is 320x320).

Thanks for the reply. After long time I found time to deal with this problem again. I agree with your explanation.

I think the only real reason this works, is because uiHB is not actually being used, and 4 is more than is necessary to represent the output C. Proof of this, is that changing uiHB and uiWA to anything else will cause it to fail. So, if you change say, matrix_size.uiHB and matrix_size.uiWA both to 7 say, then it fails. Because, the stride for matrix C should actually be matrix_size.uiWB (or alternatively, and probably more clearly, matrix_size.uiWC)

So, basically change line 272 and 285 from

checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWA));

to

checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWB));