I have matrix A^t size NM (rows, cols) and matrix B^t size NM,
Both matrix A^t and B^t are given in transpose order in memory,
Both matrices are row major order in memory.
I’d like to multiple
C = A^T * B,
Which shall be in dimensions: N*N
I already know that in order to switch between row major and column major I can transpose the matrix during the multiplication command.
Thus I have been thinking about doing the following multiplication
C = (A^T)^T * B^T = A * B^T
However it’s still doesn’t seems to be work right, I guess there is something I’m missing.
I have been trying the following command.
int numCols = N;
int numRows = M;
cublasStatus = cublasDgemm(cublasHandle, CUBLAS_OP_T , CUBLAS_OP_N , numCols ,
Example given:
Let’s say for instance that I have two matrices, written in Matlab style:
A = [ 1 , 2 ; 3, 4 ; 5, 6 ];
B = [ 7, 8 ; 9 , 10 ; 11 , 12];
However in memory I have the matrices:
% A^T = [ 1, 3, 5 ; 2,4,6];
% B^T = [ 7 ,9 , 11 ; 8 , 10 , 12 ];
At = A';
Bt = B';
I’d like to multiply At*B, which is in dim(2,2)
According to Matlab
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cublas_v2.h>
#include <stdio.h>
int main()
const double At[6] = { 1 , 3 , 5 , 2 , 4 , 6 } ;
const double Bt[6] = { 7 , 9 , 11 , 8 , 10 , 12} ;
// Dimensions for the original matrix, the originals A, B.
unsigned int cols = 2;
unsigned int rows= 3;
double * p_d_At, * p_d_Bt, * p_d_C;
p_d_At = p_d_Bt = p_d_C = 0 ;
cudaMalloc((void**)&p_d_At, sizeof(double)*cols*rows);
cudaMalloc((void**)&p_d_Bt, sizeof(double)*cols*rows);
cudaMalloc((void**)&p_d_C, sizeof(double)*cols*cols);
cudaMemcpy(p_d_At,At,sizeof(double)*cols*rows, cudaMemcpyHostToDevice);
cudaError_t cudaError = cudaMemcpy(p_d_Bt,Bt,sizeof(double)*cols*rows, cudaMemcpyHostToDevice);
cublasStatus_t cublasStatus;
cublasHandle_t cublasHandle;
const double scale = 1.0;
cublasStatus = cublasDgemm(cublasHandle , CUBLAS_OP_N , CUBLAS_OP_N , cols, cols , rows, &scale, p_d_At, cols,p_d_Bt,cols, &scale,p_d_C , cols ) ;
double C[4];
cudaMemcpy(C, p_d_C , sizeof(double)*cols*cols , cudaMemcpyDeviceToHost);
return 0;
Unfortunately the solution is not correct :(
I have several questions:
1st, How to use cublas in order to calculate the former multiplication.
2nd, How to use cublas in order to receive matrix C in row major order ?
3rd, How to use cublas in order to receive matrix C in column major order ?
Thanks in advance