Question about cublas demo matrixMulCUBLAS

bnsh · May 10, 2015, 5:01am

My question is about what this program is doing… (http://docs.nvidia.com/cuda/cuda-samples/#matrix-multiplication–cublas-)

So, when I run it, I see this:

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GRID K520" with compute capability 3.0

MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)

And, it says that it passed. But, how can this possibly be? You can’t multiply A * B at all, let alone a matrix of that shape… Shouldn’t matrix A’s columns be the same as matrix B’s rows? And, yet, no matter how I change the values of matrix_size.uiHA, matrix_size.uiWA etc. I get wrong results. (Unless I preserve the same sort of structure as above: All three matrices having the same dimensions. It seems that matrix_size.uiHB is useless, aside from allocating memory.

From the comments, it seems like what’s actually happening is Transpose[B] * Transpose[A] is being calculated, and that seems to be what matrixMulCPU is doing: It’s multiplying the matrices (wB x wA) * (wA x hA)

So, changing to this:

matrix_size.uiWB = 2 * block_size * iSizeMultiple;
    matrix_size.uiHB = 2 * block_size * iSizeMultiple;

    matrix_size.uiWA = 2 * block_size * iSizeMultiple;
    matrix_size.uiHA = 4 * block_size * iSizeMultiple;

    matrix_size.uiWC = 2 * block_size * iSizeMultiple;
    matrix_size.uiHC = 4 * block_size * iSizeMultiple;

says everything is good. If tho, I change matrix_size.uiHB and matrix_size.uiWA to 3 tho, then I get a bunch of errors. I’m at a loss to explain what’s going on in this demo… Can someone explain it to me?

Thanks!

Robert_Crovella · May 10, 2015, 5:32am

It’s a confusing sample code. This may help:

[url]https://devtalk.nvidia.com/default/topic/764976/cuda-programming-and-performance/bug-in-cuda-sdk-sample-matrix-multiplication-cublas-/[/url]

bnsh · May 10, 2015, 5:44am

Thanks!, But, actually… I think it really is a bug… And, I think this is the bug:

checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWA));

Because what’s being multiplied is TransposeB * TransposeA = TransposeC, C’s dimensions should be WB x HA, right? But, if you look at the LDA parameter for cublasSgemm, it shows matrix_size.uiWA.

When I change that code to

checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWB));

Then, the code appears to work, no matter what I change the values of HB and WA to. (They have to be the same, of course, although in reality, it seems as if HB isn’t really being used, except for sizing B.

Robert_Crovella · May 10, 2015, 2:20pm

What it is doing is multiplying B*A, and it is using only half of B. That is to say, instead of using all of B(320x640), it is using “half” of B(320x320)

There is no transposing going on, at least not by the cublasSgemm operation.

B(320x320) * A(320x640) = C(320x640)

If by “bug” you mean that it is suggesting that it is multiplying A(320x640) by B(320x640), then I’ll agree with you. That is a “bug” in representation. Otherwise, I’m not sure what you mean by bug.

I’ve already stated that it’s confusing several times now. So you can continue to develop your thesis if you wish, but I’m not sure I’ll have much to say. My suggestion is, if you don’t like that particular CUDA sample then move on. I don’t happen to like it.

bnsh · May 10, 2015, 5:37pm

There is no transposing going on, at least not by the cublasSgemm operation.

Of course there is. It’s right there in the comments even:

// CUBLAS library uses column-major storage, but C/C++ use row-major storage.
// When passing the matrix pointer to CUBLAS, the memory layout alters from
// row-major to column-major, which is equivalent to an implicit transpose.

// In the case of row-major C/C++ matrix A, B, and a simple matrix multiplication
// C = A * B, we can't use the input order like cublasSgemm(A, B)  because of
// implicit transpose. The actual result of cublasSegemm(A, B) is A(T) * B(T).
// If col(A(T)) != row(B(T)), equal to row(A) != col(B), A(T) and B(T) are not
// multipliable. Moreover, even if A(T) and B(T) are multipliable, the result C
// is a column-based cublas matrix, which means C(T) in C/C++, we need extra
// transpose code to convert it to a row-based C/C++ matrix.

The bug is this: cublasSgemm uses a stride (the LDA, LDB and LDC parameters), which correspond to strides in those matrices. So, although you might give the matrix A a width and a column size, it actually uses LDA to compute the index in those arrays. the stride for matrix C should be it’s column width, which is the width of B, not the width of A.

It’s an interesting suggestion to “move on”. The point of figuring out that it was a bug, is that I’m new to BLAS to begin with, and needed to understand how this code was working in order to “move on” and write my own code. In any case, I actually think that I do understand what’s happening now, and have in fact moved on to write my own code.

Robert_Crovella · May 10, 2015, 11:53pm

I mean, there is no explicit transpose requested from the CUBLAS operation. CUBLAS_OP_N means no transpose.

I apologize for saying “if you don’t like that particular CUDA sample then move on”. Perhaps you have taken umbrage at it. You might, however, find other examples of cublasSgemm usage that don’t have “bugs” and might be better candidates to learn from. I know that if someone asked me for a sample cublasSgemm code to learn from, that is not the one I would suggest. In fact, cublasSgemm closely follows canonical blas gemm implementations, so nearly any BLAS gemm example could be a possible candidate.

I have to confess I still don’t know what you mean by bug. Does it produce incorrect results as written? I was under the impression that the results are validated against a corresponding CPU implementation.

bnsh · May 11, 2015, 5:14am

The bug, is that I should be able to change matrix_size.uiHB and matrix_size.uiWA to any value I like (provided that they are the same) and expect it to work. This will not happen. Because the call to cublasSgemm uses matrix_size.uiWA instead of matrix_size.uiWB, it thinks the stride is that value. So, if you change matrix_size.uiHB and matrix_size.uiWA to 5 (which corresponds to A being a 4x5 matrix, and B being a 5x2 matrix), that should be perfectly legal, yielding a 4x2 matrix for C.

What you will get if you do that, is a whole mess of errors. The problem is best illustrated, if you change the code to not use the " * block_size * iSizeMultiple", and just try to look at the output, of a simple 4x5 * 5x2 matrix operation. Here it is. First, change the sizes like so:

matrix_size.uiWA = 5; // * block_size * iSizeMultiple;
    matrix_size.uiHA = 4; // * block_size * iSizeMultiple;
    matrix_size.uiWB = 2; // * block_size * iSizeMultiple;
    matrix_size.uiHB = 5; // * block_size * iSizeMultiple;
    matrix_size.uiWC = 2; // * block_size * iSizeMultiple;
    matrix_size.uiHC = 4; // * block_size * iSizeMultiple;

Then, I added a function to print out matrices.

static void dumpMatrix(const char *label, const float *data, int width, int height) {
	printf("%s = [", label);
	for (int i = 0; i < height; ++i) {
		if (i) printf(";");
		printf("\n\t");
		for (int j = 0; j < width; ++j) {
			if (j) printf(",");
			printf("%.7f", data[i*width+j]);
		}
	}
	printf("\n];\n");
}

And, I added code to call that function at line 335 (after matrixMulCPU(reference, h_A…)

dumpMatrix("A", h_A, matrix_size.uiWA, matrix_size.uiHA);
    dumpMatrix("B", h_B, matrix_size.uiWB, matrix_size.uiHB);
    dumpMatrix("CUBLAS", h_CUBLAS, matrix_size.uiWC, matrix_size.uiHC);
    dumpMatrix("reference", reference, matrix_size.uiWC, matrix_size.uiHC);

What you will get is this:

A = [
	0.3891475,0.1086010,0.0878469,0.1122995,0.1364722;
	0.9458501,0.1784126,0.2887482,0.2918750,0.8214855;
	0.4599280,0.7481667,0.9996260,0.8306763,0.6484379;
	0.0090721,0.7942271,0.1159123,0.6661789,0.6745310
];
B = [
	0.1573469,0.4363011;
	0.9075463,0.4702461;
	0.7924889,0.0902933;
	0.7907283,0.4541965;
	0.2466550,0.5527674
];
CUBLAS = [
	0.3518693,0.3552301;
	0.9729913,1.1093043;
	2.3603363,0.9729913;
	1.1093043,1.0633413
];
reference = [
	0.3518692,0.3552301;
	0.9729913,1.1093043;
	2.3603363,1.3784747;
	1.5072275,1.0633413
];

And, then a dump of difference, followed by

Comparing CUBLAS Matrix Multiply with CPU results: FAIL

The reference version comes from matrixMulCPU and is correct. The one from CUBLAS is wrong. You can verify this in MATLAB or octave. It is wrong, because the last parameter to cublasSgemm is matrix_size.uiWA instead of matrix_size.uiWB. A properly written program should not only work on particular shapes of matrices, it should work on all valid sizes. I call this a bug. It should not matter that it happens to work on very particular inputs.

For reference, I’ve pushed this all here: http://cablemodem.hex21.com/~binesh/nvidia/matrixMulCUBLAS/

http://cablemodem.hex21.com/~binesh/nvidia/matrixMulCUBLAS/matrixMulCUBLAS.cpp is the original for reference.
http://cablemodem.hex21.com/~binesh/nvidia/matrixMulCUBLAS/matrixMulCUBLAS-bug.cpp demonstrates the bug.
http://cablemodem.hex21.com/~binesh/nvidia/matrixMulCUBLAS/matrixMulCUBLAS-works.cpp is my proposed fix. Basically changing the LDC parameter in the Sgemm call from matrix_size.uiWA to matrix_size.uiWB. That is the extent of the change from -bug and -works.

Topic		Replies	Views
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2200	June 8, 2021
Bug in CUDA SDK Sample 'Matrix Multiplication (CUBLAS)'? CUDA Programming and Performance	3	1062	May 10, 2015
CUBLAS problem CUDA Programming and Performance	32	19289	March 28, 2012
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	940	August 23, 2018
Cublas and Matrix Multiplication of a transpose matrix with CUBLAS CUDA Programming and Performance	10	13248	June 14, 2010
DGEMM parameter number 8 had an illegal value GPU-Accelerated Libraries	7	10074	August 12, 2013
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7565	January 11, 2020
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7490	September 1, 2010
Matlab mex file using cublas - problems CUDA Programming and Performance	13	8954	October 13, 2009
Is there an error in the cuda manual matrix multiplication example? CUDA Programming and Performance	11	12880	December 1, 2016

Question about cublas demo matrixMulCUBLAS

Related topics