Difference between predictions of exported TensorRT engine and PyTorch pth models

Are you training with BGR or RGB?
Your offsets in nvinfer are I think in BGR but your normalisation in pytorch seems to be RGB.

And for your Resize in pytorch does it maintain aspect ratio? If yes, what kind of padding does it add?
Nvinfer by default does not maintain aspect ratio, but if it does it will either do bottom right zero padding or symmetric-padding (if symmetric-padding=1)

Lastly did you check your Resize Interpolation method? If I remember correctly, default in pytorch is different from default in nvinfer.

Hi,
We made some steps which helped us to achieve better result, but we still get only 60% images predictions match between deepstream and local pytorch inferences.

Steps that we made:

  1. We changed nvinfer offset to RGB (offsets=123.675;116.28;103.53)

  2. We trained new model based on images saved from deepstream (in png format) after object detection phase with following pytorch transformations:
    def get_transforms():

    train_transforms = A.Compose(
    [
    A.Resize(height=224, width=224, interpolation=cv2.INTER_NEAREST),
    A.HorizontalFlip(p=0.5),
    A.VerticalFlip(p=0.5),
    A.RandomGamma(gamma_limit=(75, 90), p=0.8),
    A.GridDropout(ratio=0.47, p=0.6),
    A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
    ]
    )

    test_transforms = A.Compose(
    [
    A.Resize(height=224, width=224, interpolation=cv2.INTER_NEAREST),
    A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
    ]
    )

As you can see we are using INTER_NEAREST method, because we find out that this method is used by deepstream (About the resize method in nvvideoconvert/nvstreammux - #6 by Fiona.Chen)

We know the fact that different libraries gives different resizing result even though using same interpolation methods and because of that it is crucial for us to know how nvinfer interpolation method works and how it can be replicated in our offline training to achieve same preprocessing in production and offline environments.
We hope that someone have solved similar issue and can share information with us.

Thanks

I had been struggling with a similar problem. So I ended up using scaling-fitler=1 or scaling-filter=2 in the SGIE configuration file. Note that I had not changed the resize method for Streammux as I did not need to resize there.

To confirm my results, I had exported the crops from deepstream’s detector and run inference on them in pytorch (which was much higher than what i was getting in deepstream).
So I played around with the scaling filter and just by changing to scaling-filter=1, the accuracy went up significantly (~15-20% on different videos). scaling-filter=2 had similar results but it kept crashing on dGPU so I stuck to scaling-filter=1

Thank you for your suggestions.
Changing offsets to 123.675;116.28;103.53 improved results the most and scaling-filter=1 helped get better results as well. But there are still mismatches between models predictions, now only ~20% predictions mismatch, compared to previously ~75%.

I have also tried every suggestion @mchi referred, but none improved results.

I see.
Lastly I can think of the precision. How about try to run with fp32? (network-mode=0) and check?
And maybe also explicitly disable maintain-aspect-ratio=0 because your pytorch resize does not maintain the aspect ratio.

Right I forgot to mention that I had also changed network-mode=0 previously and I’m not using A.Resize(height=224, width=224, interpolation=cv2.INTER_NEAREST) anymore but went back to using A.Resize(height=224, width=224) because results got worse in most cases while changing settings that @mchi referred to.

I have tried what you mention by adding explicitly maintain-aspect-ratio=0 but it had no impact on results.

I see.
Would it be easy for you to train a new model with a

  1. Resize with aspect ratio; and
  2. Bottom-Right zero padding

(Something like this Resize image while maintaing the aspect ratio · Issue #718 · albumentations-team/albumentations · GitHub) I have not tested this augmentation, you may need to try.

Then try by maintain-aspect-ratio=1 in your sgie config file.

According to this, the pre-processing in local inference is: (input - mean) * std

And, according to your deepstream config below, it’s : (input - offsets) * net-scale-factor

So, to make these two match, you need to make:

  1. net-scale-factor = std
  2. offsets * net-scale = factormean * std

And, try avoiding scaling in nvstreammux.

[property]
gpu-id=0
offsets=103.53;116.28;123.675
net-scale-factor=0.01735207357279195
labelfile-path=…/classifier/labels.txt
model-engine-file=…/classifier/efficientnet.engine
infer-dims=3;224;224

Could you explain in more detail how I should change my offsets and net-scale-factor, because currently I have it calculated according to this comment where:
np.array([0.485, 0.456, 0.406])*255 = array([123.675, 116.28 , 103.53 ])
And np.array([0.229, 0.224, 0.225]).mean()*255 = 57.63
Therefore net-scale-factor is going to be 1/57.63 = 0.01735207357279195.

you can refer to
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvinfer.html#gst-nvinfer

I have retrained my model with these transformations:
train_transforms = A.Compose(
[
A.LongestMaxSize(max_size=224, interpolation=1),
A.PadIfNeeded(min_height=224, min_width=224, border_mode=0, value=(0, 0, 0)),
A.HorizontalFlip(p=0.5),
A.VerticalFlip(p=0.5),
A.RandomGamma(gamma_limit=(75, 90), p=0.8),
A.GridDropout(ratio=0.47, p=0.6),
A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
ToTensorV2(),
]
)

And in config I have added maintain-aspect-ratio=1, but unfortunately results didn’t get better.
I have also tried adding symmetric-padding=1 and removing scaling-filter=1 but it didn’t help aswell.

My current config I’m using:
offsets=123.675;116.28;103.53
net-scale-factor=0.01735207357279195
network-input-order=0
labelfile-path=…/picklist-classifier/labels.txt
model-engine-file=…/picklist-classifier/efficientnet.engine
infer-dims=3;224;224
model-color-format=0
network-mode=0
network-type=1
num-detected-classes=128
interval=0
classifier-threshold=0
scaling-filter=1
maintain-aspect-ratio=1

can you explain how this work? As I said above, does it do the same pre-processing with below DS config?

offsets=123.675;116.28;103.53
net-scale-factor=0.01735207357279195

Have you gone through the option I listed in DeepStream SDK FAQ - #21 by mchi and confirm the settings are configued to be the same as that are used in training

Sorry for the late reply,
This augmentation is slightly different from Nvinfer’s preprocessing

A.PadIfNeeded(min_height=224, min_width=224, border_mode=0, value=(0, 0, 0)),

Can you please add this to get the same padding as nvinfer.

A.PadIfNeeded(min_height=224, min_width=224, border_mode=0, value=(0, 0, 0), position="top_left")

And its equivalent nvinfer config will be

maintain-aspect-ratio=1
symmetric-padding=0

Btw just want to confirm whether your PGIE has any object size filtering? What I mean is whether the PGIE discards objects below a certain size? If so, then maybe that could be an issue as well.

Yes, so I take:
mean=(0.485, 0.456, 0.406) and multiply every value by 255, which equals to 123.675;116.28;103.53 e.g. 0.485 * 255 = 123.675
And for std=(0.229, 0.224, 0.225) I calculate mean of std which is (0.229+0.224+0.225) / 3 = 0.226 and then calculate net-scale-factor by 1 / (0.226 * 255) = 0,017352074

I have gone through the options You have listed in DeepStream SDK FAQ and confirmed that I’m using the same ones for training as I’m for DS

Thank you for all the help so far.
I have tried what you suggested by adding position=‘top_left’ to PadIfNeeded function, however predictions on pth model and from deepstream still differ. Maybe you are right that I’m doing something wrong in my PGIE.

I’m using darknet YOLOv4 model in PGIE, it’s config:
[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
model-color-format=0
custom-network-config=…/yolo/checkpoint/yolov4.cfg
model-file=…/yolo/checkpoint/yolov4.weights
model-engine-file=…/yolo/checkpoint/model_b1_gpu0_fp16.engine
labelfile-path=…/yolo/labels.txt
batch-size=10
network-mode=2
num-detected-classes=1
interval=0
gie-unique-id=1
process-mode=1
network-type=0
cluster-mode=4
maintain-aspect-ratio=0
parse-bbox-func-name=NvDsInferParseYolo
custom-lib-path=…/yolo/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
engine-create-func-name=NvDsInferYoloCudaEngineGet

[class-attrs-all]
pre-cluster-threshold=0.4

Yolov4 cfg:
[net]
batch=64
subdivisions=16
width=256
height=256
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches=6000
policy=steps
steps=4800, 5400
scales=.1,.1

#cutmix=1
mosaic=1

When you visualize your video, which of these situations occur?

  1. Bbox there, but wrong classification
    Incase this happens, do you have a tracker between your PGIE and SGIE? (NOTE: Tracker may alter the bounding boxes which may cause some difference in classification result. So it would be good to move tracker after SGIE to verify.)
  2. No bbox
    Incase this happens, very likely the detector is the problem. From the nvinfer code it seems the default min object width and height is 16. Are your objects smaller than that? Maybe they are being discarded? And I see you have disabled clustering, then I think it would be worth setting pre-cluster-threshold=0.0 just to ensure each box is rendered.

Thank you for all the help @marmikshah

When I visualize my video I can see that bbox is there, but classifications are wrong. I do not have a tracker between PGIE and SGIE unfortunately.

Another thing I have tested is writing a benchmark script which takes engine model file and classifies images locally. This benchmark of engine model gives me matching results of 98% comparing predictions with pth model locally. However this engine file was generated on a different PC than which I’m running Deepstream 5.1 on and a different version of TensorRT was used. Instead of 7.2.2.3 (which is used in Deepstream 5.1) I have used 8.5.2.2 version for local engine benchmark, because only newer versions are supported with tensorrt python package.

So my question is, can an older version of TensorRT decrease precision so drastically, or there is still something wrong with my deepstream config files?

@mchi Thank you for all the help so far.
I have a question about TensorRT. Is it possible to run Deepstream 5.1 with TensorRT 8.5.2.2 version?

Hi @Zygislu
No, DS5.1 can’t support TensorRT8.5. Is it possible for you to upgrade to DeepStream 6.2 for TensorRT 8.5.2.2?

Have you confirmed the input to DeepStream 5.1 nvinfer is the same as the input to pth model?

Are you using FP16 inference precision? If you are, could you try FP32? If you see accuracy drop with both FP16 and FP32, I suspect it’s not caused by TensorRT.

Thank you for your answer @mchi
Currently it is not possible to upgrade current system to DeepStream 6.2 unfortunately.

As far as I can tell it is the same, yes. But maybe I’m missing something crucial. I have RTSP stream, which goes to Streammux → PGIE (YOLO model 1 class) → SGIE (Classification model which works only on YOLO detections) → I save images, which went through Streammux, OD coordinates and classification class in Redis.
From Redis I save images with coordinates and classification class in their names.
Locally I cut saved images according to OD coordinates and pass them through pth model and compare pth predicted class with classification ID in images name.

I’m using FP32 inference precision for both - OD and classification.