How to use CUBLAS in C ?

Hello,

I’m trying to multiply two matrixes with the cublasSgemm function :

A

0.1    0.01

0.001  0

0      0

At

0.0  0.001 0

0.01 0     0

B

1 2

3 4

5 6

I must execute the product At*B that gives :

AtB

0.103 0.01

0.204 0.02

It works fine with Goto Blas 2 but now I would like to accelerate the processing by using Cublas. I tried to use it, but as it considers the code written in Fortran, here are the matrix it uses (the f is for fortran, the t is for transpose) :

Af

0.1   0

0.01  0

0.001 0

Aft

0.1 0.01 0.001

0   0    0

Bf

1 4

2 5

3 6

AftBf

0.123 0.456

0     0

(AftBf)c

0.123 0

0.456 0

(It took me a long time to find how it found that final matrix )

I work in the optimization field and I don’t want to re-write matrixes ColMajor, how can I do to use Cublas in CUDA so it can multiply matrixes RowMajor ?

Thank you.

EDIT :

I realized that if you change the dimension of matrixes (mn to nm), than you switch the matrix in the product (AtB ==> BAt), it theoratically works in Fortran (the f is for fortran, the t is for transpose) :

A (3x2)

0.1    0.01

0.001  0

0      0

At (3x2)

0.1  0.001 0

0.01 0     0

B (3x2)

1 2

3 4

5 6

At*B

0.103  0.204

0.01   0.02

Bf (2x3)

1 3 5

2 4 6

Af (2x3)

0.1   0.001  0

0.01  0      0

Aft

0.1    0.01

0.001  0

0      0

Bf * Aft

0.103 0.01

0.204 0.02

When that final matrix is read in C, it gives :

0.103 0.204

0.01  0.02

At*B !!!!

I tried this with 3 matrixes :

V (m*r - 3*2 (C)): 

0.100000 0.010000 0.001000 0.200000 0.020000 0.002000 

M (m*n - 3*2 (C)): 

1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 

W (n*r - 2*2 (C)): 

17.739658 28.783649 14.911300 19.856079

I have to do VtM, WMt, VtV and WWt. I use these four settings for the function calls :

cublasSgemm('n', 't', n, r, m, 1.0, ****, n, cuV, m, 0.0, cuVtM, n);

cublasSgemm('n', 't', r, r, m, 1, cuV, r, cuV, m, 0, cuVtV, r);

cublasSgemm('n', 't', m, r, n, 1.0, cuMt, m, cuWt, n, 0.0, cuWMt, m);

cublasSgemm('n', 't', r, r, n, 1, cuWt, r, cuWt, n, 0, cuWWt, r);

Results are incorrect for the two first products but are okay for the two last :

CuVtM :

1.408013 1.849615 3.104844 3.741813

VtM :

0.203000 0.324000 0.620000 0.832000

CuVtV :

0.013032 0.041283 0.013159 0.005314

VtV :

0.010401 0.001240 0.001240 0.040104

CuWMt :

75.306961 168.353577 261.400177 54.623459 124.158218 193.692963

WMt :

75.306961 168.353577 261.400208 54.623459 124.158218 193.692963

CuWWt :

1143.193970 836.051758 836.051758 616.610718 0.000000 0.000000

WWt :

1143.193970 836.051758 836.051758 616.610718 0.000000 0.000000

I realized that the two last results are correct because W is a square Matrix (else it doesn’t work anymore).

Have someone an idea where I did a mistake and what should I do to have the correct results ?

Thank you

I finally found how to use Cublas in a C program. Here is the code which allows to use the Blas row major convention :

void rcudaSgemm(char transa, char transb, unsigned int m, unsigned int n, unsigned int k, float alpha, float * A, unsigned int lda, float * B, unsigned int ldb, float beta, float * C, unsigned int ldc){

	if(transa == 'n' || transa == 'N'){

		if(transb == 'n' || transb == 'N'){

			cublasSgemm('n', 'n', n, m, k, alpha, B, ldb, A, lda, beta, C, ldc);

		}

		else{

			cublasSgemm('t', 'n', n, m, k, alpha, B, ldb, A, lda, beta, C, ldc);

		}

	}

	else{

		if(transb  == 'n' || transb == 'N'){

			cublasSgemm('n', 't', n, m, k, alpha, B, ldb, A, lda, beta, C, ldc);

		}

		else{

			//TODO

			//cublasSgemm('t', 't', n, m, k, alpha, B, ldb, A, lda, beta, C, ldc);

		}

        }

}

void rcudaSgemv(char transa, unsigned int m, unsigned int n, float alpha, float * A, unsigned int lda, float * X, unsigned int incx, float beta, float * Y, unsigned int incy){

	if(transa == 'n' || transa == 'N'){

		cublasSgemm('n', 'n', 1, m, n, alpha, X, incx, A, lda, beta, Y, incy);

	}

	else{

		cublasSgemm('n', 't', 1, n, m, alpha, X, incx, A, lda, beta, Y, incy);	

	}

}

I’ve not yet look at the fourth case of the sgemm function, but it should be similar to other cases. I’ve not yet tested the case where leading dimensions are different of m, n or k.