I believe that I am seeing a bug in cublasLt, it looks like possibly a race condition with the execution stream. I am using cuda 11, just installed recently, and a Titan RTX. The attached testcase explains the issue in more detail.
Here’s the description of the issue:
When both alpha and beta are non-zero, cublasLtMatmul usually gives:
C = alphaAB + betaC
but sometimes gives this instead:
C = alphaAB + beta(AB + C),
The testcase that shows this issue is very simple, so this looks like a bug to me.
This is an important feature because it allows gradient accumulation in fully connected layers of DNN’s, so I’ve looked into it a bit more, I see that after I build CublasTestcase, which runs the same small test 1000 times in a row, that this command:
./CublasTestcase
always fails after 100 or so iterations of the test, which is a matrix multiply of 2 2x2 matrices.
However if I run this:
cuda-memcheck CublasTestcase
then all 1000 test case iterations pass every time it is run.