I’m implementing an algorithm that requires the matrix inverse of a 6 x 6 matrix (or larger) in each thread. Any ideas on how to implement this? The main problem, as I see it, is that it requires a lot of registers to store both the original matrix and the matrix inverse. As far as I know, it is only possible to use 64 registers per thread with the Fermi GPUs. The local memory (= L1 cache?) will be used for the rest of the registers?

I know that the matrix is symmetric, this reduces the number of registers and perhaps makes it easier to calculate the inverse?

I also need to calculate the matrix square root of the inverse (i.e. not elementwise square root).