Matrix inverse of a 6 x 6 matrix in each thread

I’m implementing an algorithm that requires the matrix inverse of a 6 x 6 matrix (or larger) in each thread. Any ideas on how to implement this? The main problem, as I see it, is that it requires a lot of registers to store both the original matrix and the matrix inverse. As far as I know, it is only possible to use 64 registers per thread with the Fermi GPUs. The local memory (= L1 cache?) will be used for the rest of the registers?

I know that the matrix is symmetric, this reduces the number of registers and perhaps makes it easier to calculate the inverse?

I also need to calculate the matrix square root of the inverse (i.e. not elementwise square root).

Got the main algorithm sorted out,

first I calculate the Cholesky decomposition of the matrix A, such that I get a lower triangle L. Then I perform forward substitution on this matrix to get the inverse Linv and finally calculate the inverse of the original matrix A by multiplying L it with its transpose, i.e. Ainv = Linv’*Linv.

For the matrix square root I can use the Babylonian approach, which uses one matrix inverse in each iteration.

Time for implementation…

I don’t have any personal experience with computing the matrix square root, but the following paper might be helpful:
Nicholas J. Higham
Newton’s Method for the Matrix Square Root
Mathematics of Computation, Vol. 46, No. 174, April 1986, pp. 537-549