I am buffled by the fact that CUBLAS is collumn-major and I cant use it effectivelly.(I am using It from inside a C++ application)
The problem is rather simple.
I have two arrays ,U_phi_transposed and data_gpu.The U_phi_transposed array is derived from transposing U_phi in a previous action of the algorithm and is mandatory to transpose it because I am using it in next calls.
The dimentions are:
data_gpu:NDATANDIM
U_phi=NDATANCLASS
U_phi_transposed=NCLASS*NDATA
And I want to have an array c1=U_phi_transposed*data_gpu.
Can anyone please tell me how to use CUBLAS with the least amount of transposes?
I have looked into the MatMul example from the SDK but it only works for square matrices,(or I havent managed to make it work for rectangular ones).
Thank you for your time
check out www.netlib.org - BLAS section.
It has a C interface document for BLAS.
It is simply superb.
Most of the row-major routines can be derived from col-major routines just by manipulating the parameters - No transpose is actually needed.
For example, consider GEMM operation C += AB
Now if I pass the matrices in row-major, CUBLAS would actually see C^T, A^T and B^T.
The operation C += AB can also be be achieved by C^T += (AB)^T which is also equal to "C^T += B^TA^T"
Thus, you just need to swap the A and B parameter and call GEMM NN variant and also tinker around with MNK parameters and get the job done.
You really dont need to transpose the matrices…
Most BLAS routines can be done way like this (except for possibly conjugates and hermitians – for which you may need to conjugate your matrix before calling col-major routines)