Im trying to multiply large matrices on a CUDA device and want to know if my supposition is correct.

I assume if a device has limited RAM then the calling routines need to break the matrices into blocks to manipulate on the device. Im currently trying to multiply 2 matrices of around 1.75GB each and am writing a wrapper to do it in blocks. However, are there routines already out there that do this? Has anyone already done this? The idea is to produce a wrapper that will automatically break a matrix into the required number of blocks appropriate to the number of devices and power available.

Im trying to matrix multiply then invert a matrix of 30,000 * 50,000 full precision

Thanks for any guidance