the geam function is the usual one suggested for just a transpose
its often recommended to avoid transposing or moving data unnecessarily
if you want to handle row-major input in CUBLAS, it can be done in some cases, but it requires special manipulation of parameters. Here is a recent related question. Note the link in the comments to the previous question with the excerpted treatment by Mr. Wittek.
If I am not mistaken, this arrangement seems to work for your test case:
# cat
#include <iostream>
#include <cublas_v2.h>
int main(){
float *dat, *da;
const int R = 2;
const int C = 3;
int T = R;
cudaMallocManaged(&dat, R*C*sizeof(dat[0]));
cudaMallocManaged(&da, R*C*sizeof(dat[0]));
float di[R*C] = {0.849739, 0.989397, 0.288401, 0.46367, 0.471273, 0.158544};
float alpha = 1.0f;
float beta = 0.0f;
cublasHandle_t handle;
memcpy(da, di, R*C*sizeof(da[0]));
// tranpose(da) -> dat
T, C,
&alpha, da, C,
&beta, da, C,
dat, T
for (int i = 0; i < R*C; i++) std::cout << dat[i] << " ";
std::cout << std::endl;
# nvcc -o t303 -lcublas
# compute-sanitizer ./t303
0.849739 0.46367 0.989397 0.471273 0.288401 0.158544
========= ERROR SUMMARY: 0 errors
FWIW I note that my presentation of arguments is exactly the same as yours. So whatever problem you were having is not evident from what you have posted/shown. The geam call you have indicated does not result in the transposed matrix you indicated. Anyway, I think it works. (You would get the output you indicated if you set C=3, T=2, but that is contrary to what you have indicated in the comment before the call.)