Can CuBLAS do a simple transpose?

I have a matrix that is 2x3 (stored in row-major from c). I want to use CuBLAS to tranpose the matrix to 3x2. I tried:

float alpha = 1.0f;
 float beta = 0.0f;
// tranpose(da) -> dat, C=2, T=3
cublasSgeam(handle,
                CUBLAS_OP_T, CUBLAS_OP_T,  
                C, T,              
                &alpha, da, T,        
                &beta, da, T,   
                dat, C              
    );

But the result I get is a weird strided output:

Original matrix:
[0.849739, 0.989397, 0.288401;
0.46367, 0.471273, 0.158544]
Transposed matrix:
[0.849739, 0.288401;
0.471273, 0.989397;
0.46367, 0.158544]

Pretty sure that the column-major input for CuBLAS is causing this but I can’t pinpoint what’s happening. Any help would be appreciated!

On a higher level, I want to use CuBLAS to do matrix transpose but now I’m unsure if that’s even possible/intended.

  • CUBLAS expects column-major data storage.
  • the geam function is the usual one suggested for just a transpose
  • its often recommended to avoid transposing or moving data unnecessarily
  • if you want to handle row-major input in CUBLAS, it can be done in some cases, but it requires special manipulation of parameters. Here is a recent related question. Note the link in the comments to the previous question with the excerpted treatment by Mr. Wittek.

If I am not mistaken, this arrangement seems to work for your test case:

# cat t303.cu
#include <iostream>
#include <cublas_v2.h>

int main(){

  float *dat, *da;
  const int R = 2;
  const int C = 3;
  int T = R;
  cudaMallocManaged(&dat, R*C*sizeof(dat[0]));
  cudaMallocManaged(&da,  R*C*sizeof(dat[0]));
  float di[R*C] = {0.849739, 0.989397, 0.288401, 0.46367, 0.471273, 0.158544};
  float alpha = 1.0f;
  float beta = 0.0f;
  cublasHandle_t handle;
  cublasCreate(&handle);
  memcpy(da, di, R*C*sizeof(da[0]));
// tranpose(da) -> dat
  cublasSgeam(handle,
                CUBLAS_OP_T, CUBLAS_OP_T,
                T, C,
                &alpha, da, C,
                &beta, da, C,
                dat, T
    );
  cudaDeviceSynchronize();
  for (int i = 0; i < R*C; i++) std::cout << dat[i] << " ";
  std::cout << std::endl;
}
# nvcc -o t303 t303.cu -lcublas
# compute-sanitizer ./t303
========= COMPUTE-SANITIZER
0.849739 0.46367 0.989397 0.471273 0.288401 0.158544
========= ERROR SUMMARY: 0 errors
#

FWIW I note that my presentation of arguments is exactly the same as yours. So whatever problem you were having is not evident from what you have posted/shown. The geam call you have indicated does not result in the transposed matrix you indicated. Anyway, I think it works. (You would get the output you indicated if you set C=3, T=2, but that is contrary to what you have indicated in the comment before the call.)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.