 # A newbie question on cublasSgemm

I’m using Cuda 1.1 and
I’m trying the tutorial code on simple matrix multiplication,
to test its performance on my 8600M.
So, when I try to compare the results, it seems that with the simple algorithm I get matrixC = matrixAmatrixB, but with the cublasSgemm I get matrixC = matrixBmatrixA! How it’s possible? (if I change A with B in the simple algorithm, I get the same result as cublas).

The code is the same as in the official programming guide.
The cublasSgemm call is:
cublasSgemm( ‘n’,‘n’, m, n, k, 1.0f , dA, lda, dB, ldb, 0.0f, dC, ldc );

Thank you.

Cublas is using Fortran memory layout (column-major), your C code is probably using C memory layout ( row-major), so in one case you are computing the product with matrices that are the transpose of the other case.

I’m using the same C code as in tutorial:

[…]

for (int k = 0; k < 16; ++k) Csub += As[ty][k] * Bs[k][tx];

[…]

k is the column index in As, and the row index in Bs. Or not?

So, if I use the cublas library, I should pass matrix to cublassgemm columns by columns, instead of rows by rows?

so, to pass

|x00 x10 x20|
|x01 x11 x21|
|x02 x12 x22|

I should pass an array made by:
[x00, x01, x02, x10, x11, x12, x20, x21, x22]
[x00, x10, x20, x01, x11, x21, x02, x12, x22]
?

I thought that Fortran memory layout was only a way to write the contract form of an array of array… ex. in C array is an array of 2 columns * 5 rows were
near values are in the same row,
and in Fortran array is an array of 5 columns * 2 rows were near values are in the same row too… sorry!!

This will do the job:

cublasSgemm( ‘t’,‘t’, m, n, k, 1.0f , dA, lda, dB, ldb, 0.0f, dC, ldc );

right… thank you.

:) I’ll change just A with B in the call, because also C is transpose.