Let’s say I have A{1}, A{2}, … A{N} and B{1}, B{2}, … B{N} where A{i} and B{i} are MXM matrix (with M being around 1000 to 2000) for all i.

I need to multiply each A with each B giving me a total of N^2 matrix matrix multiplications. I can use a nested for loop to do this but the problem is that for either A or B (not both), I would have to load the content (which is stored in disk) to memory redundantly (totally of N^2 reads) given that N is large enough that I can’t load all of them onto memory at once.

If I’m conducting these calculations on a GPU, is there a good way for me to expedite this entire process? Thanks.