cuDNN number of processed bytes unclear

According to “Deep Learning Performance Guide” (https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html#math-mem), the number of bytes accessed is: 2*(MK + NK + MN) . I don’t understand the “multiply by 2” part. If we are reading two input matrices and writing one, then shouldn’t the total number of accessed global memory read / write just be: MK + NK + MN ?

or is this not global memory read / write?

Hi,

The 2 in the denominator of that ops/bytes equation comes from the fact that the matrices are FP16, which == 2 bytes per element.
https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html#imp-gemm-dim

Thanks