I don’t know how large your matrices are, but arithmetically matrix multiply can be decomposed into sub-problems. The beginning part of the answer here:
What kind of GPU do you use? 21 MB is not particularly large as matrices go, and with even budget GPUs offering at least 1 GB of on-board memory, you should be able to keep several such matrices resident on the GPU. You may want to look more closely at the memory management performed by your application.
Agreed. your out of memory problem should be investigated rather than looking for a different library.
If you have an out of memory issue that is actually due to this matrix size (perhaps because you have ~50 such matrices in memory) then the correct solution would be to manage that situation somehow. cufftXt won’t solve any problems like that for you.
There are a number of different Tesla GPUs with different amounts of memory, but it is reasonably safe to assume that your Tesla has at least 4 GB of on-board memory and thus enough memory for many instances of a 21 MB matrix. Since you have not shown any code that would allow others to reproduce your issue, I can only give the general recommendations to
review the number of memory allocations, and the size of each
properly check the return status of each CUDA API call, each CUBLAS API call, and each kernel launch
Yes, but a 21MB matrix is not a very large matrix. There would be no need or reason to split it up. Rather than pursuing this path, I would suggest getting a very crisp understanding of the out of memory issue. A 21MB matrix cannot by itself cause an out of memory issue on any GPU.