FLOPS calculation in cublasDgemm

Dear all,
What will be the number of floating point operations in cublasDgemm for a typical matrix multiplication? Normally we have the typical form like this
C = alpha * op(A) * op(B) + beta * C

My question is, if we use zero for the variable beta (beta = 0), then what happens, is it doing the multiplication and addition with zero (0), Or is it just skip that part (i.e., beta * C)? And does alpha = 1 make any difference in the number of FLOP? In the matrixMul example in NVIDIA SDK, the FLOPS is computed as

FLOPS = 2.0 * (double)uiWA * (double)uiHA * (double)uiWB

How it is calculated, Can any one explain this a little.

Thanks for your help

Bishwa

a matrix (7,10) x a matrix (10,20)
must do 72010 mult and 72010 add so flop are 2720*10 compare to the time

Thanks cricri1. Quite clear explanation. Is it the same for DGEMM ?

sory i dont know what is DGEMM

DGEMM is a very efficient matrix multiplication subroutine in BLAS (Basic Linear Algebra Subroutine) Library (http://netlib.org/blas/). CUDA has it’s own BLAS library called CUBLAS.

Anyway, I got this below response from one of NVIDIA software engineer,

If beta is zero, then the multiplications with beta and addition to C are not done (i.e. the computation of beta*C is skipped completely).

The multiplication operation for alpha is always done, even if alpha == 1, but these operations are quite negligible compared to the bulk of AB computation.The GFLOPS are usually approximated as 2mnk + 3mn whenever beta is not 0, and 2mn*k for the more simple case of beta=0.

Hope it helps others as well.

Thanks