P100 is much slower than P4???

I tested the convolution operation on P100 and P4 with tensorflow 1.8 as follows:

x = tf.Variable(tf.random_normal((64, 512, 55, 55), dtype=tf.float32))
f = tf.Variable(tf.random_normal((3, 3, 512, 512), dtype=tf.float32))
conv_op = tf.nn.conv2d(x, f, [1, 1, 1, 1], 'SAME', data_format='NCHW')

The timeline.json is generated on both P100 and P4, which shows:

XX	                   P4	                 P100
occurrences	            5	                  5
Wall duration	        77.903ms	       122.510ms
Average Wall Duration	15.561ms	        23.502ms

How can the convolution consume much more time on P100 than on P4? Since the pronounced teraFlops are:

XX    Double precision	    Single precision	      Half precision
P4	                     5.5 teraFLOPS	         22 teraFLOPS
P100	4.7 teraFLOPS	     9.3 teraFLOPS	        18.7 teraFLOPS