Convolution speed issue

We are comparing the speed of pytorch’s convergence system, and the calculation shows that the B-layer has fewer parameters and flops than the A-layer, but the A layer’s speed is faster than B layer. Why is that? Is 3x3 conv optimized on the cuda system?

input channel : 128
output channel :128
image size : 64 x 32

A layers : 3x3 conv + 3x3 conv
B layers : 3x1 conv + 1x3 conv + 3x1 conv + 1x3 conv