I tested the convolution operation on P100 and P4 with tensorflow 1.8 as follows:
x = tf.Variable(tf.random_normal((64, 512, 55, 55), dtype=tf.float32))
f = tf.Variable(tf.random_normal((3, 3, 512, 512), dtype=tf.float32))
conv_op = tf.nn.conv2d(x, f, [1, 1, 1, 1], 'SAME', data_format='NCHW')
The timeline.json is generated on both P100 and P4, which shows:
XX P4 P100
occurrences 5 5
Wall duration 77.903ms 122.510ms
Average Wall Duration 15.561ms 23.502ms
How can the convolution consume much more time on P100 than on P4? Since the pronounced teraFlops are:
XX Double precision Single precision Half precision
P4 5.5 teraFLOPS 22 teraFLOPS
P100 4.7 teraFLOPS 9.3 teraFLOPS 18.7 teraFLOPS