From what I can tell, Gauss-Jordan elimination does not break down very well to SIMD architectures and the GPU specifically.

Can anyone suggest a way to attack matrix inversion when dealing with very large square matrices, think 100x100 matrices and even higher, specifically with GPU in mind?

You should be able to implement Gauss-Jordan on a GPU and see speedup. You’ll just have to launch your kernel(s) more than once, in order for threadblocks to “exchange” results.

Well I was actually thinking matrices with thousands of rows, I just gave 100 as an example, and anyway coming from gaming its rare to have a matrix that big obviously.