cuDNN number of processed bytes unclear

According to “Deep Learning Performance Guide” (, the number of bytes accessed is: 2*(MK + NK + MN) . I don’t understand the “multiply by 2” part. If we are reading two input matrices and writing one, then shouldn’t the total number of accessed global memory read / write just be: MK + NK + MN ?

or is this not global memory read / write?


The 2 in the denominator of that ops/bytes equation comes from the fact that the matrices are FP16, which == 2 bytes per element.