The result of Depthwise Separable Convolution layerfusion is incorrect

**Depthwise Separable Convolution **
A depthwise convolution with activation followed by a convolution with activation may sometimes be fused into a single optimized DepSepConvolution layer. The precision of both convolutions must be INT8 and the device computes capability must be 7.2 or later.
(T4: 7.5 ,P4000: 6.1)

When I use nvidia quadro p4000 (6.1 unfused) . conv1+ relu1 did not fuse with conv/dw1+ relu/dw1.

when I use tesla T4 : four layers are fused to one layer. but the output is very different form that of p4000 and the accuracy result is absolutely incorrect . When I mark conv1 as output which can prevent conv1 fuse with conv/dw and get the same result of P4000.

But markout conv1 will affect speed , is there other methods to prevent layer fusion. And why
conv1+relu1+conv/dw1+ relu/dw1 generate the very diffrent result compared with
layer1: conv1+relu1 layer2: conv/dw1+ relu/dw1 .