Dear all,
What will be the number of floating point operations in cublasDgemm for a typical matrix multiplication? Normally we have the typical form like this
C = alpha * op(A) * op(B) + beta * C
My question is, if we use zero for the variable beta (beta = 0), then what happens, is it doing the multiplication and addition with zero (0), Or is it just skip that part (i.e., beta * C)? And does alpha = 1 make any difference in the number of FLOP? In the matrixMul example in NVIDIA SDK, the FLOPS is computed as
DGEMM is a very efficient matrix multiplication subroutine in BLAS (Basic Linear Algebra Subroutine) Library (http://netlib.org/blas/). CUDA has it’s own BLAS library called CUBLAS.
Anyway, I got this below response from one of NVIDIA software engineer,
If beta is zero, then the multiplications with beta and addition to C are not done (i.e. the computation of beta*C is skipped completely).
The multiplication operation for alpha is always done, even if alpha == 1, but these operations are quite negligible compared to the bulk of AB computation.The GFLOPS are usually approximated as 2mnk + 3mn whenever beta is not 0, and 2mn*k for the more simple case of beta=0.