Linear algebra library for small to medium size matrices?

Hi,

I was wondering if anyone of you know about CUDA compatible libraries for small to medium size matrices (say, up to size 128x128) that one can use in kernel code.

I only heard of the Eigen library, but I suspect that they do computation on each matrix in a single thread to keep the API compatible with host side code. This would lead to problematic memory access patterns when doing many matrix operations on different threads.

I was wondering if anyone has developed libraries that make use of the CTG (cooperative thread groups) feature of CUDA, that might be able to keep a matrix of known size in registers and perform some operations fully parallelized.

Previously I’ve done a matrix implementation that used warp synchronous programming, supporting matrix sizes up to 32xN. However that’s out of vogue ever since Volta and Ampere.