I am now searching for a way to calculate a matrix division in a mex file: “C=A/B” Google searching finds only hints and promises of such code; this is of course implemented in LAPACK, but that’s not available for CUDA yet.
Right now I am using the matlab mexCallMATLAB function to use matlab’s division. My division itself is not that intensive, but having to copy matrices in and out of the GPU to do the division is a bit annoying/time consuming.
Is code for calculating this division available now, or is it still in the works?
Is there a CUDA routine available for matrix division: X=A/B, aka, the solution to B*X=A? Or can someone comment on the status of such code?
I have run across a few recent research papers along these lines; perhaps it is a little too early to ask for such code, although I am sure someone out there has taken the trouble to implement this.
This is related to e.g., level 3 BLAS routine sgetrf and LAPACK routine sgetrs, so I suspect that something like this is in the works. (In CUDA Version 2.0 final release?)
I think I’d rather wait for the tool to become available than attempt to implement it myself from the BLAS/LAPACK routines…
I have tried to implement this by means of gaussian elimination with partial pivoting, but was getting inaccurate results. I plan on trying to implement full pivoting, but have resorted for the moment to also do it in matlab, as I have other kernels taking 15 times as much as the division in matlab.
I believe matlab computes this using LU decomposition, so maybe you can find some LU code for CUDA and use the results to do the division. (if you find some LU code, let me know)
Do you know of an efficient way of inverting lower or upper diagonal matrices? To my knowledge, while with some tricks this is vectorizable I know of no way to parallelize it in a way that will make the GPU version any faster than the CPU one.