Multiplying array of matrices

Let’s say I have A{1}, A{2}, … A{N} and B{1}, B{2}, … B{N} where A{i} and B{i} are MXM matrix (with M being around 1000 to 2000) for all i.

I need to multiply each A with each B giving me a total of N^2 matrix matrix multiplications. I can use a nested for loop to do this but the problem is that for either A or B (not both), I would have to load the content (which is stored in disk) to memory redundantly (totally of N^2 reads) given that N is large enough that I can’t load all of them onto memory at once.

If I’m conducting these calculations on a GPU, is there a good way for me to expedite this entire process? Thanks.

Given the matrices aren’t all that large, you could upload multiple B matrices as a block and simultaneously compute multiple products against a given A matrix in one gemm() call. You could also use asynchronous memory transfers to overlap uploads and downloads with multiply kernel execution. That would basically make the copies free.

How do you do that? I’ve come across this idea of memory uploads while the kernel is doing something else but it’s not in the programming guide (that or I missed it). Is there some link to a tutorial on this or something similar as it would be very useful for me as I’m dealing with continual sensory input from a robot, including camera images which could be constantly being uploaded to the GPU while it’s running various processes on the previous data…


Look at the section on the streams API in the guide, and then the async versions of the memory management calls in the reference manual.

How can I do this with one gemm call (I’m using double precision right now I’m working cublasDgemm)? Moreover, how much is it faster if I conduct N AxB matrix multiplications in one call vs N individual calls?

A x (B1 U B2 U B3) = AB1 U AB2 U AB3 where U denotes the column wise union.

I don’t know, but DGEMM performs better at larger matrix sizes, and so do the pci-e copy routines.

EDIT: Actually I do know. CUBLAS DGEMM scales more or less perfectly linearly with increasing column B matrix column count up to peak flops. At M=2000, N=6000, K=2000 you should be hitting about 80 gflops.