Question about cublas demo matrixMulCUBLAS

My question is about what this program is doing… (http://docs.nvidia.com/cuda/cuda-samples/#matrix-multiplication–cublas-)

So, when I run it, I see this:

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GRID K520" with compute capability 3.0

MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)

And, it says that it passed. But, how can this possibly be? You can’t multiply A * B at all, let alone a matrix of that shape… Shouldn’t matrix A’s columns be the same as matrix B’s rows? And, yet, no matter how I change the values of matrix_size.uiHA, matrix_size.uiWA etc. I get wrong results. (Unless I preserve the same sort of structure as above: All three matrices having the same dimensions. It seems that matrix_size.uiHB is useless, aside from allocating memory.

From the comments, it seems like what’s actually happening is Transpose[B] * Transpose[A] is being calculated, and that seems to be what matrixMulCPU is doing: It’s multiplying the matrices (wB x wA) * (wA x hA)

So, changing to this:

matrix_size.uiWB = 2 * block_size * iSizeMultiple;
    matrix_size.uiHB = 2 * block_size * iSizeMultiple;

    matrix_size.uiWA = 2 * block_size * iSizeMultiple;
    matrix_size.uiHA = 4 * block_size * iSizeMultiple;

    matrix_size.uiWC = 2 * block_size * iSizeMultiple;
    matrix_size.uiHC = 4 * block_size * iSizeMultiple;

says everything is good. If tho, I change matrix_size.uiHB and matrix_size.uiWA to 3 tho, then I get a bunch of errors. I’m at a loss to explain what’s going on in this demo… Can someone explain it to me?

Thanks!

It’s a confusing sample code. This may help:

https://devtalk.nvidia.com/default/topic/764976/cuda-programming-and-performance/bug-in-cuda-sdk-sample-matrix-multiplication-cublas-/

Thanks!, But, actually… I think it really is a bug… And, I think this is the bug:

checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWA));

Because what’s being multiplied is TransposeB * TransposeA = TransposeC, C’s dimensions should be WB x HA, right? But, if you look at the LDA parameter for cublasSgemm, it shows matrix_size.uiWA.

When I change that code to

checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWB));

Then, the code appears to work, no matter what I change the values of HB and WA to. (They have to be the same, of course, although in reality, it seems as if HB isn’t really being used, except for sizing B.

What it is doing is multiplying B*A, and it is using only half of B. That is to say, instead of using all of B(320x640), it is using “half” of B(320x320)

There is no transposing going on, at least not by the cublasSgemm operation.

B(320x320) * A(320x640) = C(320x640)

If by “bug” you mean that it is suggesting that it is multiplying A(320x640) by B(320x640), then I’ll agree with you. That is a “bug” in representation. Otherwise, I’m not sure what you mean by bug.

I’ve already stated that it’s confusing several times now. So you can continue to develop your thesis if you wish, but I’m not sure I’ll have much to say. My suggestion is, if you don’t like that particular CUDA sample then move on. I don’t happen to like it.

There is no transposing going on, at least not by the cublasSgemm operation.

Of course there is. It’s right there in the comments even:

// CUBLAS library uses column-major storage, but C/C++ use row-major storage.
// When passing the matrix pointer to CUBLAS, the memory layout alters from
// row-major to column-major, which is equivalent to an implicit transpose.

// In the case of row-major C/C++ matrix A, B, and a simple matrix multiplication
// C = A * B, we can't use the input order like cublasSgemm(A, B)  because of
// implicit transpose. The actual result of cublasSegemm(A, B) is A(T) * B(T).
// If col(A(T)) != row(B(T)), equal to row(A) != col(B), A(T) and B(T) are not
// multipliable. Moreover, even if A(T) and B(T) are multipliable, the result C
// is a column-based cublas matrix, which means C(T) in C/C++, we need extra
// transpose code to convert it to a row-based C/C++ matrix.

The bug is this: cublasSgemm uses a stride (the LDA, LDB and LDC parameters), which correspond to strides in those matrices. So, although you might give the matrix A a width and a column size, it actually uses LDA to compute the index in those arrays. the stride for matrix C should be it’s column width, which is the width of B, not the width of A.

It’s an interesting suggestion to “move on”. The point of figuring out that it was a bug, is that I’m new to BLAS to begin with, and needed to understand how this code was working in order to “move on” and write my own code. In any case, I actually think that I do understand what’s happening now, and have in fact moved on to write my own code.

I mean, there is no explicit transpose requested from the CUBLAS operation. CUBLAS_OP_N means no transpose.

I apologize for saying “if you don’t like that particular CUDA sample then move on”. Perhaps you have taken umbrage at it. You might, however, find other examples of cublasSgemm usage that don’t have “bugs” and might be better candidates to learn from. I know that if someone asked me for a sample cublasSgemm code to learn from, that is not the one I would suggest. In fact, cublasSgemm closely follows canonical blas gemm implementations, so nearly any BLAS gemm example could be a possible candidate.

I have to confess I still don’t know what you mean by bug. Does it produce incorrect results as written? I was under the impression that the results are validated against a corresponding CPU implementation.

The bug, is that I should be able to change matrix_size.uiHB and matrix_size.uiWA to any value I like (provided that they are the same) and expect it to work. This will not happen. Because the call to cublasSgemm uses matrix_size.uiWA instead of matrix_size.uiWB, it thinks the stride is that value. So, if you change matrix_size.uiHB and matrix_size.uiWA to 5 (which corresponds to A being a 4x5 matrix, and B being a 5x2 matrix), that should be perfectly legal, yielding a 4x2 matrix for C.

What you will get if you do that, is a whole mess of errors. The problem is best illustrated, if you change the code to not use the " * block_size * iSizeMultiple", and just try to look at the output, of a simple 4x5 * 5x2 matrix operation. Here it is. First, change the sizes like so:

matrix_size.uiWA = 5; // * block_size * iSizeMultiple;
    matrix_size.uiHA = 4; // * block_size * iSizeMultiple;
    matrix_size.uiWB = 2; // * block_size * iSizeMultiple;
    matrix_size.uiHB = 5; // * block_size * iSizeMultiple;
    matrix_size.uiWC = 2; // * block_size * iSizeMultiple;
    matrix_size.uiHC = 4; // * block_size * iSizeMultiple;

Then, I added a function to print out matrices.

static void dumpMatrix(const char *label, const float *data, int width, int height) {
	printf("%s = [", label);
	for (int i = 0; i < height; ++i) {
		if (i) printf(";");
		printf("\n\t");
		for (int j = 0; j < width; ++j) {
			if (j) printf(",");
			printf("%.7f", data[i*width+j]);
		}
	}
	printf("\n];\n");
}

And, I added code to call that function at line 335 (after matrixMulCPU(reference, h_A…)

dumpMatrix("A", h_A, matrix_size.uiWA, matrix_size.uiHA);
    dumpMatrix("B", h_B, matrix_size.uiWB, matrix_size.uiHB);
    dumpMatrix("CUBLAS", h_CUBLAS, matrix_size.uiWC, matrix_size.uiHC);
    dumpMatrix("reference", reference, matrix_size.uiWC, matrix_size.uiHC);

What you will get is this:

A = [
	0.3891475,0.1086010,0.0878469,0.1122995,0.1364722;
	0.9458501,0.1784126,0.2887482,0.2918750,0.8214855;
	0.4599280,0.7481667,0.9996260,0.8306763,0.6484379;
	0.0090721,0.7942271,0.1159123,0.6661789,0.6745310
];
B = [
	0.1573469,0.4363011;
	0.9075463,0.4702461;
	0.7924889,0.0902933;
	0.7907283,0.4541965;
	0.2466550,0.5527674
];
CUBLAS = [
	0.3518693,0.3552301;
	0.9729913,1.1093043;
	2.3603363,0.9729913;
	1.1093043,1.0633413
];
reference = [
	0.3518692,0.3552301;
	0.9729913,1.1093043;
	2.3603363,1.3784747;
	1.5072275,1.0633413
];

And, then a dump of difference, followed by

Comparing CUBLAS Matrix Multiply with CPU results: FAIL

The reference version comes from matrixMulCPU and is correct. The one from CUBLAS is wrong. You can verify this in MATLAB or octave. It is wrong, because the last parameter to cublasSgemm is matrix_size.uiWA instead of matrix_size.uiWB. A properly written program should not only work on particular shapes of matrices, it should work on all valid sizes. I call this a bug. It should not matter that it happens to work on very particular inputs.

For reference, I’ve pushed this all here: http://cablemodem.hex21.com/~binesh/nvidia/matrixMulCUBLAS/