the geam function is the usual one suggested for just a transpose
its often recommended to avoid transposing or moving data unnecessarily
if you want to handle row-major input in CUBLAS, it can be done in some cases, but it requires special manipulation of parameters. Here is a recent related question. Note the link in the comments to the previous question with the excerpted treatment by Mr. Wittek.
If I am not mistaken, this arrangement seems to work for your test case:
# cat t303.cu
#include <iostream>
#include <cublas_v2.h>
int main(){
float *dat, *da;
const int R = 2;
const int C = 3;
int T = R;
cudaMallocManaged(&dat, R*C*sizeof(dat[0]));
cudaMallocManaged(&da, R*C*sizeof(dat[0]));
float di[R*C] = {0.849739, 0.989397, 0.288401, 0.46367, 0.471273, 0.158544};
float alpha = 1.0f;
float beta = 0.0f;
cublasHandle_t handle;
cublasCreate(&handle);
memcpy(da, di, R*C*sizeof(da[0]));
// tranpose(da) -> dat
cublasSgeam(handle,
CUBLAS_OP_T, CUBLAS_OP_T,
T, C,
&alpha, da, C,
&beta, da, C,
dat, T
);
cudaDeviceSynchronize();
for (int i = 0; i < R*C; i++) std::cout << dat[i] << " ";
std::cout << std::endl;
}
# nvcc -o t303 t303.cu -lcublas
# compute-sanitizer ./t303
========= COMPUTE-SANITIZER
0.849739 0.46367 0.989397 0.471273 0.288401 0.158544
========= ERROR SUMMARY: 0 errors
#
FWIW I note that my presentation of arguments is exactly the same as yours. So whatever problem you were having is not evident from what you have posted/shown. The geam call you have indicated does not result in the transposed matrix you indicated. Anyway, I think it works. (You would get the output you indicated if you set C=3, T=2, but that is contrary to what you have indicated in the comment before the call.)