gemm[Strided]Batched bug with beta=1.0?

I have two batches of matrices, A and B. All matrices in A have the same dimension, and the same holds for B. I’d like to multiply A[i] * B[i] for all i up to batch size. Instead of storing the result in a batch of output matrices, I’d like to store the sum of A[i] * B[i] for all i up to batch size into a single output matrix R (with appropriate dimensions).

Looking at the documentation for gemm[Strided]Batched, it seems like the right way to go is:

  1. Set output matrix R to zeros.
  2. Run gemm[Strided]Batched with beta=1.0 and Carray[i] = R in the not-strided case or strideC = 0 in the strided case.
  3. Profit!

This approach doesn’t seem to produce correct values. I haven’t found any documentation that tells me whether pointers in Carray can be aliased, nor anything stating whether strideC can be zero. In either case, the gemm kernel launches without error so I believe these parameters should be supported. The documentation for gemmStridedBatched describes the computation as:

C + i * strideC = α op ( A + i * strideA ) op ( B + i * strideB ) + β ( C + i * strideC ) ,  for i  ∈ [ 0 , b a t c h C o u n t − 1 ]

It’s unclear whether the for is a parallel for or a sequential for. Further, this blog post ( describes the computation in sequential terms and it seems that the approach I described above ought to work.

Is this a bug in the gemm code or is the scenario I described not supported by cuBLAS?

Can you provide a self-contained reproducer of the issue?

The following code demonstrates the issue.

A is a batch of column vectors, B is a batch of row vectors, and C is the sum of all outer products, MatMul(A[i],B[i]) for all i up to batch size. A[0] = ones, B[0] = twos, and A[i>0] = B[i>0] = 0. The expected result is C[i][j] = 2 for all i,j. The actual result is that some entries of C are 0, some are 2. Pretty sure this behavior indicates that the read/modify/write on C is not atomic.

#include <cstdio>
#include <cstring>
#include <cuda_runtime_api.h>
#include <cublas_v2.h>

float* DeviceCopy(const void* src, size_t size) {
  float* ptr;
  cudaMalloc((void**)&ptr, size);
  cudaMemcpy(ptr, src, size, cudaMemcpyHostToDevice);
  return ptr;

int main() {
  static constexpr int batch_size = 2;
  static constexpr int dim = 16;

  float A[1][dim][batch_size];
  float B[dim][1][batch_size];
  float C[dim][dim];

  memset(A, 0, sizeof(A));
  memset(B, 0, sizeof(B));
  memset(C, 0, sizeof(C));

  for (int i = 0; i < dim; ++i) {
    A[0][i][0] = 1.0;
    B[i][0][0] = 2.0;

  float* Ad = DeviceCopy(A, sizeof(A));
  float* Bd = DeviceCopy(B, sizeof(B));
  float* Cd = DeviceCopy(C, sizeof(C));

  cublasHandle_t blas_handle;

  static const float alpha = 1.0;
  static const float beta = 1.0;

      dim, dim, 1,
      Ad, dim, dim,
      Bd, 1, dim,
      Cd, dim, 0,

  cudaMemcpy(C, Cd, sizeof(C), cudaMemcpyDeviceToHost);
  for (int i = 0; i < dim; ++i) {
    for (int j = 0; j < dim; ++j)
      printf("%0.0f ", C[i][j]);

  return 0;


I don’t think this is expected usage. I think its a reasonably safe assumption that the C[i] matrices cannot overlap, and that there is no usage of atomics under the hood to ensure that overlap can be made to work.

The strided batched operation is intended to duplicate the behavior of a bunch of independent gemm operations, launched in parallel, i.e. simultaneously. We must assume that the gemm function call be allowed to load its C value(s) at the point of function call. If all members of the batch are allowed to do that, then their results could not possibly be correct in the general case, even with the use of atomics.

I’ll see about getting a doc update for this but won’t be able to report status/progress towards doc update(s) if any. If you’d like your own process to track a doc update, please file a bug requesting the documentation update using the bug filing instructions linked to a sticky post at the top of the CUDA programming sub-forum.

Thanks for clarifying, Robert.

In addition to the doc update, it would be nice if these functions returned CUBLAS_STATUS_INVALID_VALUE when strideC < m * n. This seems like an inexpensive and straightforward precondition check and provides immediate feedback to users.