It is not clear whether you want to actually invert matrices or need a solver for small dense systems. What is needed for most use cases is the latter, also for numerical reasons.
Assuming the matrices don’t get any larger than 9x9, one potential strategy is to write a self-contained dedicated solver (with partial pivoting if necessary) for each particular matrix size, completely unroll all the loops and convert the matrix elements into scalars. In that scenario, each thread handles one matrix, and it works best if the matrix elements are float, because high register usage might become a performance limiter otherwise.
This used to work reasonably well for me quite a number of years ago, prior to the introduction of batched interfaces in CUBLAS. The advantage is that you are not restricted by the API. You would definitely want to start with an experiment, using the smallest relevant matrix size to see how that holds up performance-wise.
Thank you for your suggestion, njuffa! Yes, I wrote my own solver optimized for my matrix pattern and it was a few times faster. I did not write a generic getrf/getrs solver, though, since I’m not sure how efficient it would be.
I am glad to hear it worked out well. Yes, that is what I meant: Build a specialized solver specifically for you use case, rather than using a generic interface suggested by standard BLAS. Generic interfaces provide the widest applicability, but often this comes at the cost of performance.