32x32 matrix invert NOT using cuBLAS?

Hey experts,

I need to implement a matrix invert for a uni project but I can’t use any libraries. I feel like I could implement this using 1 cuda block but using 16384 bytes of shared memory seems like it would decrease my performance by 4x? I need this to run BLAZING fast. Any thoughts on implementation?

If you are a registered CUDA developer, you can download the “Batched Solver” code from the CUDA registered developer website which includes code for inverting small matrices. The actual file downloaded would be BatchedSolver_v1_1.tgz.

The included source code is under a BSD license that should be compatible with just about any kind of project. Note that this code is several years old and somewhat outdated at this point, but you should be able to adapt this for your project, and it should enable a faster start than when starting from scratch.

If you are not yet a registered CUDA developer, registration is easy and approval usually happens within one business day. Start at https://developer.nvidia.com/