CUBLAS matrix multiplication

Hello. Sorry for my English, it’s not my native language. Please, help me to fix my code. I try to multiply 3 matrices: A(m x n), B(n x k) and D(k x l)

This code work sometimes, but sometimes it’s not.
For example, with
"
const int m=100;
const int n=101;
const int k=102;
const int l=103;
"
I get
"
(:
Device: nan; Host: 28234250977280.000000
):
"

The first multiplication is correct (smile) and the second - not.

With
"
const int m=110;
const int n=39;
const int k=112;
const int l=132
"

I get

"
** On entry to SGEMM parameter number 10 had an illegal value
Multiplication failed. (1)
"

Multiplication does not start.

With
"
const int m=11;
const int n=11;
const int k=11;
const int l=11;
"
I get right answer, but I want not only to multiply square matrices (and it does not work for all nubmers).

Here is code: http://pastebin.com/Jz3gubV1. I can’t paste it in this message (can’t create topic)

Sorry, now I do not use the “code” tag, because it works badly (maybe only in my firefox 16?)

Keep in mind that the storage convention use by CUBLAS for two-dimensional matrices is column-major ordering (the elements of a column occupy consecutive storage locations). This is the ordering used by Fortran and Matlab, for example. C and C++ use row-major ordering. Consequently, the “leading dimension” arguments (LDA, LDB, LDC) passed to *GEMM would be equal to the number of rows in each matrix for your examples.

We are aware of issues with the “code” tag in the new forums, sorry for the problems with that. I have raised the issue internally before, will do so again. What issues did you see specifically? The one I have encountered is that line-extension backslashes in multiline macros get eliminated. Also, one cannot simply cut & paste from a “code” section because the line numbers have been added.

Sorry, I did not understand exactly. I read, that there is column-major ordering ( here http://docs.nvidia.com/cuda/cublas/index.html#topic_7_2 ). And it says, that lda is number of the rows in matrix.I thought that we have in mind is column-major ordering matrix, and so, I have matrices A(m x n) and B(n x k). When I put it in cublasSgemm, i must think, that I multiply B(k x n) and A (n, m) (i must change the order). So, we have

cublasSgemm(handle,
			CUBLAS_OP_N, CUBLAS_OP_N,
			k, n, m,
			scal,      // alpha (1)
			dev_B, k,  // ?
			dev_A, n,  // ?
			(scal+1),  // beta (0)
			dev_C, k); // ?

and so, we have C(k x m), matrices are column-oriented. When we get it, it is C(m x k).

When I tried to apply 4 code tags, I get one messages in every tag. And I could not place program code (because topic not created). May be there is character limit?

The storage layout conventions do not change the mathematical dimensions of the matrix. In this case (A and B are not transposed):

GEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC)

A(m x k) --> LDA is m
B(k x n) --> LDA is k
C(m x n) --> LDA is m

Sorry, I can’t understand it. For example,
m=2, n=3, k=2;
A={1,2,3,4,5,6};
B={7,8,9,1,2,3,};
For row-oriented we have
1 2 3
4 5 6

and

7 8
9 1
2 3

But for column-oriented we have
1 3 5
2 4 6

and

7 1
8 2
9 3

There is such code in SDK sample (matrixMul.cu)

//some NVIDIA's code

//note cublas is column primary!
            //need to transpose the order
            cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, uiWB, uiHA, uiWA, &alpha, d_B, uiWB, d_A, uiWA, &beta, d_C, uiWA);

//some NVIDIA's code

 //Performs warmup operation using matrixMul CUDA kernel
		if (block_size == 16) {
            matrixMul>(d_C, d_A, d_B, uiWA, uiWB);

//some NVIDIA's code

So, we know, that (A x B)t (transposed) = (B)t x (A)t, and we know, that reading row-oriented matrices in column-oriented way is just matrix transposition. So, my code looks correct. Please, correct me if i’m wrong.

Looks like parser cuts some code. I mean
matrixMul<16><<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB);

I’m so sorry. I had to read the instructions carefully. Here is code:

//memory allocation and so on
...
//first multiplication
 stat = cublasSgemm(handle,
			CUBLAS_OP_N, CUBLAS_OP_N,
			k, m, n,
			scal,      // alpha (1)
			dev_B, k,  // ?
			dev_A, n,  // ?
			(scal+1),  // beta (0)
			dev_C, k); // ?
...
 //The second multiplication 
     stat = cublasSgemm(handle,
			CUBLAS_OP_N, CUBLAS_OP_N,
			l, m, k,
     			scal,      // alpha (1)
			dev_D, l,  // ?
			dev_C, k,  // ?
			(scal+1),  // beta (0)
			dev_E, l); // ?

...

I try it for
"
const int m=110;
const int n=39;
const int k=112;
const int l=132
"
and some other parameters, and it works. The computations with large matrices have errors (CPU and GPU results differ), but it works! Thank you!