# Row major matrix to Row major matrix multiplication in cublas

I have matrix A^t size NM (rows, cols) and matrix B^t size NM,

Both matrix A^t and B^t are given in transpose order in memory,

Both matrices are row major order in memory.

I’d like to multiple

``````        C = A^T * B,
``````

Which shall be in dimensions: N*N

I already know that in order to switch between row major and column major I can transpose the matrix during the multiplication command.

Thus I have been thinking about doing the following multiplication

``````       C = (A^T)^T * B^T  = A * B^T
``````

However it’s still doesn’t seems to be work right, I guess there is something I’m missing.
I have been trying the following command.

``````    int numCols = N;
int numRows = M;
cublasStatus  = cublasDgemm(cublasHandle, CUBLAS_OP_T , CUBLAS_OP_N , numCols ,
numCols,numRows,&scale,A,numRows,B,numRows,&scale,C,numCols);
``````

Example given:

Let’s say for instance that I have two matrices, written in Matlab style:

``````  A = [ 1 , 2 ; 3, 4 ; 5, 6 ];
B = [ 7, 8 ; 9 , 10 ; 11 , 12];
``````

However in memory I have the matrices:

``````  %   A^T = [ 1, 3, 5 ; 2,4,6];
%   B^T = [ 7 ,9 , 11 ; 8 , 10 , 12 ];
At = A';
Bt = B';
``````

I’d like to multiply At*B, which is in dim(2,2)

According to Matlab

``````   #include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cublas_v2.h>
#include <stdio.h>

int main()
{

const double At = { 1 , 3 , 5 , 2 , 4  , 6 } ;
const double Bt = { 7 , 9 , 11 , 8 , 10 , 12} ;

// Dimensions for the original matrix, the originals A, B.

unsigned int cols = 2;
unsigned int rows= 3;

double * p_d_At, * p_d_Bt, * p_d_C;
p_d_At = p_d_Bt = p_d_C = 0 ;

cudaMalloc((void**)&p_d_At, sizeof(double)*cols*rows);
cudaMalloc((void**)&p_d_Bt, sizeof(double)*cols*rows);
cudaMalloc((void**)&p_d_C, sizeof(double)*cols*cols);

cudaMemcpy(p_d_At,At,sizeof(double)*cols*rows, cudaMemcpyHostToDevice);
cudaError_t cudaError = cudaMemcpy(p_d_Bt,Bt,sizeof(double)*cols*rows, cudaMemcpyHostToDevice);

cublasStatus_t cublasStatus;
cublasHandle_t cublasHandle;
cublasCreate(&cublasHandle);
const double  scale = 1.0;
cublasStatus = cublasDgemm(cublasHandle , CUBLAS_OP_N , CUBLAS_OP_N , cols, cols , rows, &scale, p_d_At, cols,p_d_Bt,cols, &scale,p_d_C , cols ) ;

double C;
cudaMemcpy(C, p_d_C , sizeof(double)*cols*cols , cudaMemcpyDeviceToHost);

return 0;
``````

}
Unfortunately the solution is not correct :(

I have several questions:

1st, How to use cublas in order to calculate the former multiplication.

2nd, How to use cublas in order to receive matrix C in row major order ?

3rd, How to use cublas in order to receive matrix C in column major order ?