Why there is always a memset kernel before a cublas matrix multiplication kernel?

system: ubuntu2004
GPU: A100 SXM 40GB
cuda version: 11.3
cuda driver version: 470.42.01
cublas version: 11.4.2.10064



Maybe BLAS primarily implements multiply-and-add kernels, and in order to just multiply, the result must be nulled out first.