I want to solve many weighted least square problems, the solution to every problem is calculated as

r = inv(B’ * W1 * W2 * B ) * B’ * W1 * W2 * S

where inv means matrix inverse, how well is this suited for CUDA?

One example can be to solve 250 000 problems, the size of B is 225 x 6, size of W1 and W2 is 225 x 225 (diagonal matrices, elements not in the diagonal are zero), the size of S is 225 x 1, the result r is of the size 6 x 1. The inverse is thus done on a 6 x 6 matrix. B and W1 is the same for each problem, W2 and S varies.

If I instead use these numbers, is it to hard for CUDA? Solve 8 million problems, B 3375 x 20, W1 and W2 3375 x 3375, S 3375 x 1, r 20 x 1. Inverse is done on a 20 x 20 matrix.