Row major matrix to Row major matrix multiplication in cublas

I have matrix A^t size NM (rows, cols) and matrix B^t size NM,

Both matrix A^t and B^t are given in transpose order in memory,

Both matrices are row major order in memory.

I’d like to multiple

        C = A^T * B, 

Which shall be in dimensions: N*N

I already know that in order to switch between row major and column major I can transpose the matrix during the multiplication command.

Thus I have been thinking about doing the following multiplication

       C = (A^T)^T * B^T  = A * B^T

However it’s still doesn’t seems to be work right, I guess there is something I’m missing.
I have been trying the following command.

    int numCols = N; 
    int numRows = M; 
    cublasStatus  = cublasDgemm(cublasHandle, CUBLAS_OP_T , CUBLAS_OP_N , numCols ,
             numCols,numRows,&scale,A,numRows,B,numRows,&scale,C,numCols); 

Example given:

Let’s say for instance that I have two matrices, written in Matlab style:

  A = [ 1 , 2 ; 3, 4 ; 5, 6 ];
  B = [ 7, 8 ; 9 , 10 ; 11 , 12];

However in memory I have the matrices:

  %   A^T = [ 1, 3, 5 ; 2,4,6]; 
  %   B^T = [ 7 ,9 , 11 ; 8 , 10 , 12 ];
  At = A'; 
  Bt = B'; 

I’d like to multiply At*B, which is in dim(2,2)

According to Matlab

   #include "cuda_runtime.h"
   #include "device_launch_parameters.h"
   #include <cublas_v2.h>
   #include <stdio.h>
   
int main()
{

 const double At[6] = { 1 , 3 , 5 , 2 , 4  , 6 } ; 
 const double Bt[6] = { 7 , 9 , 11 , 8 , 10 , 12} ; 

 // Dimensions for the original matrix, the originals A, B.  

 unsigned int cols = 2; 
 unsigned int rows= 3; 

 double * p_d_At, * p_d_Bt, * p_d_C; 
 p_d_At = p_d_Bt = p_d_C = 0 ;

 cudaMalloc((void**)&p_d_At, sizeof(double)*cols*rows);
 cudaMalloc((void**)&p_d_Bt, sizeof(double)*cols*rows);
 cudaMalloc((void**)&p_d_C, sizeof(double)*cols*cols);

 cudaMemcpy(p_d_At,At,sizeof(double)*cols*rows, cudaMemcpyHostToDevice);
 cudaError_t cudaError = cudaMemcpy(p_d_Bt,Bt,sizeof(double)*cols*rows, cudaMemcpyHostToDevice);

 cublasStatus_t cublasStatus; 
 cublasHandle_t cublasHandle; 
 cublasCreate(&cublasHandle);
 const double  scale = 1.0; 
 cublasStatus = cublasDgemm(cublasHandle , CUBLAS_OP_N , CUBLAS_OP_N , cols, cols , rows, &scale, p_d_At, cols,p_d_Bt,cols, &scale,p_d_C , cols ) ; 

 double C[4]; 
 cudaMemcpy(C, p_d_C , sizeof(double)*cols*cols , cudaMemcpyDeviceToHost); 


return 0;

}
Unfortunately the solution is not correct :(

I have several questions:

1st, How to use cublas in order to calculate the former multiplication.

2nd, How to use cublas in order to receive matrix C in row major order ?

3rd, How to use cublas in order to receive matrix C in column major order ?

Thanks in advance

Cublas assume that the matrices are in column major.
Because you have A,B in row major, it means that you actually have A1=A^T and B1=B^T

You want to compute : C = A^T * B = A1*B1^T
=>So you should use the NT version of SGEMM:

cublasDgemm(cublasHandle , CUBLAS_OP_N , CUBLAS_OP_T …)

C will be then in column major. If you want to have it row major, you need to transpose it separately.
SGEMM cannot only produce C in column major

You can use cublasgeam to transpose you matrix.