cublasLt race condition? I've included a test case and explanation

I believe that I am seeing a bug in cublasLt, it looks like possibly a race condition with the execution stream. I am using cuda 11, just installed recently, and a Titan RTX. The attached testcase explains the issue in more detail.

Here’s the description of the issue:

When both alpha and beta are non-zero, cublasLtMatmul usually gives:

C = alphaAB + betaC

but sometimes gives this instead:

C = alphaAB + beta(AB + C),

The testcase that shows this issue is very simple, so this looks like a bug to me.

This is an important feature because it allows gradient accumulation in fully connected layers of DNN’s, so I’ve looked into it a bit more, I see that after I build CublasTestcase, which runs the same small test 1000 times in a row, that this command:

./CublasTestcase

always fails after 100 or so iterations of the test, which is a matrix multiply of 2 2x2 matrices.

However if I run this:

cuda-memcheck CublasTestcase

then all 1000 test case iterations pass every time it is run.