Help in using CUBLAS

I am buffled by the fact that CUBLAS is collumn-major and I cant use it effectivelly.(I am using It from inside a C++ application)
The problem is rather simple.
I have two arrays ,U_phi_transposed and data_gpu.The U_phi_transposed array is derived from transposing U_phi in a previous action of the algorithm and is mandatory to transpose it because I am using it in next calls.
The dimentions are:
data_gpu:NDATANDIM
U_phi=NDATA
NCLASS
U_phi_transposed=NCLASS*NDATA

And I want to have an array c1=U_phi_transposed*data_gpu.

Can anyone please tell me how to use CUBLAS with the least amount of transposes?
I have looked into the MatMul example from the SDK but it only works for square matrices,(or I havent managed to make it work for rectangular ones).
Thank you for your time

check out www.netlib.org - BLAS section.
It has a C interface document for BLAS.
It is simply superb.
Most of the row-major routines can be derived from col-major routines just by manipulating the parameters - No transpose is actually needed.

For example, consider GEMM operation C += AB
Now if I pass the matrices in row-major, CUBLAS would actually see C^T, A^T and B^T.
The operation C += A
B can also be be achieved by C^T += (AB)^T which is also equal to "C^T += B^TA^T"

Thus, you just need to swap the A and B parameter and call GEMM NN variant and also tinker around with MNK parameters and get the job done.
You really dont need to transpose the matrices…

Most BLAS routines can be done way like this (except for possibly conjugates and hermitians – for which you may need to conjugate your matrix before calling col-major routines)

Check out

OK,I found this before posting back.
http://forums.developer.nvidia.com/devforum/discussion/3946/cublas-and-column-major-storage/p1
I tried incorporating it into my progrmam but it fails.
I am attaching my file to see if anyone can spot the mistake,the SDK has also the same code.
Maybe I am doing something wrong here but I can’t find it.

Thanks again

mat_mul2.cu (7.1 KB)