I have a set S of matrices and I want to apply a sequence of operations on each matrix in S. For example I could apply matrix multiplication and element wise operations such as addition, division, exponentiation, square root etc. For example the code might look like this if I have 1000 matrices:

for i=1:1000{
m = S[i]; //get matrix i
m2 = mm; //matrix multiplication
m2 = m2/constant; //element wise divison
m2 = exp(m2); //element wise exponentiation
m2 = m2
m2; //matrix multiplication

Each matrix has size 300x300.

What is the most efficient way to do this? Currently I am using cublasSgemmBatched for the matrix multiplications. However this function takes float** as input, and as a result I have written kernels for each element wise operation op() that are applied to each matrix. In other words each kernel looks like this :

for each i,j
output[i][j] = op(input[i][j]).

From some benchmarks I did it is obvious that if the element-wise
operations were applied instead on a 1d array of concatenated matrices (ie: matrix i starts at index i300300) execution speed would almost double. I presume this is because a single memory access would be needed per element. However in that case I could not use cublasSgemmBatched.

One alternative option would be to invoke the standard cublasSgemm function once for each matrix, by using a different stream for each matrix (concurrent kernel execution). However would this scale if I used ~1000 streams, one distinct stream per matrix? Alternatively perhaps I could store each matrix pointer in shared memory and continue using cublasSgemmBatched. Would this speed up things, since shared memory access is much more efficient?

What do you think is the best way for me to proceed? Is there something else I should be trying?

Thanks a lot

I don’t think this is a true statement. carve out your matrix allocations all from a single allocation done via a single call to cudaMalloc.

Suppose I have 4 matrices, each of size 2x2. I do a single call to cudaMalloc for 16 elements:

e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15

I store matrix 0 in e0-e3, matrix 1 in e4-e7, etc.

For the batched operation, I pass a list of pointers:

&e0, &e4, &e8, &e12

(i.e. I pass a pointer to the start of this list, **)

For the element-wise operations, I just pass a single pointer:


with a length of 16.

If you work at the assembler level, the fastest approach by far would be to modify one of my open source gemm kernels:

Store your matrices as txbob suggests, in one contiguous allocation. Then use the z-block index to track your index i into that allocation. This is the approach Maddy Scientist suggested here:

Here’s the gist of the assembly you’d need to add:

--:-:4:-:1      S2R bz,  SR_CTAID.Z;

// trackA += bz * k * lda * 4
--:-:-:-:1      SHL lda4, lda, 2;
--:-:-:-:1      MOV k, param_k;
08:-:-:-:1      XMAD.LO tz, bz, k, RZ, xmad_tz;
--:-:-:-:1      XMAD.LO trackZA, tz, lda4, RZ, xmad_tz;
--:-:-:-:1      IADD    trackA0.CC, trackZA, trackA0;
--:-:-:-:1      IADD.X  trackA1,    trackA1, RZ;

// C += bz * n * ldc * 4
--:-:-:-:1      MOV ldc, param_ldc;
--:-:-:-:1      MOV n,   param_n;
--:-:-:-:1      SHL n4, n, 2;
--:-:-:-:1      XMAD.LO cz, bz, n4, RZ, xmad_cz;
--:-:-:-:1      XMAD.LO cz, ldc, cz, RZ, xmad_cz;
--:-:-:-:1      IADD    C0.CC, cz, param_C[0];
--:-:-:-:1      IADD.X  C1,    RZ, param_C[1];

Then you can do the elementwise operations just prior to writing the results out. I’d cast the division by a constant to a multiplication by the reciprocal.

You could also modify the code to only load matrix A once.

So that whole loop would just be two kernel calls, and they’d run at full utilization (+6Tflops on TitanX).