Has anyone tried to make a CUDA accelerated Sherman-Morrison algorithm? The algorithm computes the inverse of the sum of an invertible matrix A and the dyadic product, uvT, of a column vector u and a row vector vT.

What would be the best way to implement such algorithm in CUDA/Matlab combination?

What parts of the algorithm should I compute in Matlab and what in CUDA?

I would imagine to compute the left side, you could compute the outer product of u*v^T with a kernel, make another kernel to add the matrix A and the outer product from the first step, then a final kernel to compute the inverse of the result from the addition kernel using Gaussian elimination (by augmenting the result with an identity matrix).

Once that’s all implemented, you could compile the kernels and a host interface into a MEX file for use with matlab.