matlab, mex files and matrix division?

I’ve gotten my feet wet with CUDA, and am working on wetting my ankles now. Recall the mex file routine for using CUDA for a generalized matrix multiplication, SGEMM, from matlab:
http://forums.nvidia.com/index.php?showtopic=69566

I am now searching for a way to calculate a matrix division in a mex file: “C=A/B” Google searching finds only hints and promises of such code; this is of course implemented in LAPACK, but that’s not available for CUDA yet.

Right now I am using the matlab mexCallMATLAB function to use matlab’s division. My division itself is not that intensive, but having to copy matrices in and out of the GPU to do the division is a bit annoying/time consuming.

Is code for calculating this division available now, or is it still in the works?

Thanks

I will take the liberty of bumping this inquiry…

Is there a CUDA routine available for matrix division: X=A/B, aka, the solution to B*X=A? Or can someone comment on the status of such code?

I have run across a few recent research papers along these lines; perhaps it is a little too early to ask for such code, although I am sure someone out there has taken the trouble to implement this.

This is related to e.g., level 3 BLAS routine sgetrf and LAPACK routine sgetrs, so I suspect that something like this is in the works. (In CUDA Version 2.0 final release?)

I think I’d rather wait for the tool to become available than attempt to implement it myself from the BLAS/LAPACK routines…

I would also need a matrix division for my project. Is there already any CUDA coda or cublas library available on this subject?

Please at least get us some info on this.

I have tried to implement this by means of gaussian elimination with partial pivoting, but was getting inaccurate results. I plan on trying to implement full pivoting, but have resorted for the moment to also do it in matlab, as I have other kernels taking 15 times as much as the division in matlab.

I believe matlab computes this using LU decomposition, so maybe you can find some LU code for CUDA and use the results to do the division. (if you find some LU code, let me know)

Do you know of an efficient way of inverting lower or upper diagonal matrices? To my knowledge, while with some tricks this is vectorizable I know of no way to parallelize it in a way that will make the GPU version any faster than the CPU one.

I would check scalapack sources. There they have parallalized these kinds of algorithms (although I have a hard time understanding it)

Thanks, I will have a look. Though I forgot the critical point: My case has sparse matrices, for non-sparse matrices it should work o.k.