system: ubuntu2004 GPU: A100 SXM 40GB cuda version: 11.3 cuda driver version: 470.42.01 cublas version: 11.4.2.10064
Maybe BLAS primarily implements multiply-and-add kernels, and in order to just multiply, the result must be nulled out first.