I have two batches of matrices, A and B. All matrices in A have the same dimension, and the same holds for B. I’d like to multiply A[i] * B[i] for all i up to batch size. Instead of storing the result in a batch of output matrices, I’d like to store the sum of A[i] * B[i] for all i up to batch size into a single output matrix R (with appropriate dimensions).

Looking at the documentation for gemm[Strided]Batched, it seems like the right way to go is:

- Set output matrix R to zeros.
- Run gemm[Strided]Batched with beta=1.0 and Carray[i] = R in the not-strided case or strideC = 0 in the strided case.
- Profit!

This approach doesn’t seem to produce correct values. I haven’t found any documentation that tells me whether pointers in Carray can be aliased, nor anything stating whether strideC can be zero. In either case, the gemm kernel launches without error so I believe these parameters should be supported. The documentation for gemmStridedBatched describes the computation as:

```
C + i * strideC = α op ( A + i * strideA ) op ( B + i * strideB ) + β ( C + i * strideC ) , for i ∈ [ 0 , b a t c h C o u n t − 1 ]
```

It’s unclear whether the `for`

is a `parallel for`

or a `sequential for`

. Further, this blog post (https://devblogs.nvidia.com/cublas-strided-batched-matrix-multiply/) describes the computation in sequential terms and it seems that the approach I described above ought to work.

Is this a bug in the gemm code or is the scenario I described not supported by cuBLAS?