Hello

I have a set S of matrices and I want to apply a sequence of operations on each matrix in S. For example I could apply matrix multiplication and element wise operations such as addition, division, exponentiation, square root etc. For example the code might look like this if I have 1000 matrices:

for i=1:1000{

m = S[i]; //get matrix i

m2 = m*m; //matrix multiplication
m2 = m2/constant; //element wise divison
m2 = exp(m2); //element wise exponentiation
m2 = m2*m2; //matrix multiplication

}

Each matrix has size 300x300.

What is the most efficient way to do this? Currently I am using cublasSgemmBatched for the matrix multiplications. However this function takes float** as input, and as a result I have written kernels for each element wise operation op() that are applied to each matrix. In other words each kernel looks like this :

for each i,j

output[i][j] = op(input[i][j]).

From some benchmarks I did it is obvious that if the element-wise

operations were applied instead on a 1d array of concatenated matrices (ie: matrix i starts at index i*300*300) execution speed would almost double. I presume this is because a single memory access would be needed per element. However in that case I could not use cublasSgemmBatched.

One alternative option would be to invoke the standard cublasSgemm function once for each matrix, by using a different stream for each matrix (concurrent kernel execution). However would this scale if I used ~1000 streams, one distinct stream per matrix? Alternatively perhaps I could store each matrix pointer in shared memory and continue using cublasSgemmBatched. Would this speed up things, since shared memory access is much more efficient?

What do you think is the best way for me to proceed? Is there something else I should be trying?

Thanks a lot