Pointpillar engine has weird oscillating performances

vittoria.cavicchioli1 · May 28, 2024, 12:16pm

Description

I downloaded tao_pytorch_backend repository and I trained pointpillar from scratch on a custom version of KITTI dataset (I retained car annotations only) and then fine-tuned on a smaller custom dataset.
My final model has good performances both with tlt and its converted version in trt. However, by running the evaluation multiple times with the same trt engine (fp32) and same validation set (430 samples) I obtain oscillating metrics. I investigated trying to narrow down the problem and this is what I found so far using always fp32 trt engine:

Running 100 times evaluation on 5 validation samples shows that 49 times there’s a specific sample (let’s call it AnomalousSample0) that diverge from the original tlt model result.
Running 100 times evaluation on AnomalousSample0 ALONE shows that it never diverges from the original tlt model result.
Running 100 times evaluation on another sample of the initial 5-samples-val-set that didn’t give any problem even in batch, showed no problem as expected
Running 100 times evaluation on 10 validation samples (first 5 samples same as in first point + 5 more) shows that 3 samples diverge from the original tlt model result: AnomalousSample0 (same sample as before)16/100 times, AnomalousSample1 1 time, AnomalousSample2 100 times.
Running 100 times evaluation on 5 selected samples that never showed divergence in previous tests don’t show any divergence even in this 100-times evaluation

This seems something related to specific samples but when they are evaluated together with other samples only…

Environment

TensorRT Version: 8.6.1
GPU Type: NVIDIA GeForce RTX 4080
Nvidia Driver Version: nvidia-driver-545
CUDA Version: CUDA Driver Version / Runtime Version 12.3 / 11.8
CUDNN Version: CUDNN_MAJOR 8, CUDNN_MINOR 9
Operating System + Version: Ubuntu 22.04.4 LTS
Python Version (if applicable): 3.10.12
PyTorch Version (if applicable): 2.2.0a0+81ea7a4
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt-base

Relevant Files

At this link you’ll find:

ten samples validation set I used for above mentioned experiments ten_val_set
tlt model checkpoint_epoch_30.tlt
engine model checkpoint_epoch_30.engine
used configuration file pointpillar_general.yaml

Steps To Reproduce

python nvidia_tao_pytorch/pointcloud/pointpillars/scripts/evaluate.py --cfg_file pointpillar_general.yaml --save_to_file --output_dir path/to/output/ --key tlt_encode --trt_engine checkpoint_epoch_30.engine

AakankshaS · May 29, 2024, 5:47pm

Hi @vittoria.cavicchioli1 ,
Thank you for raising the concern, i would guide you to the correct platform for this concern, where you may raise it .

Topic		Replies	Views
Very bad result on tlt mobilenetv2 tensorrt TensorRT	5	1037	January 5, 2022
How to make "enable_center_crop"ed images TAO Toolkit	3	523	August 16, 2022
Nvidia tao pointpillars 'EasyDict' object has no attribute 'train' TAO Toolkit	2	174	May 22, 2024
TensorRT Inference form a .etlt model on Python TAO Toolkit tensorrt	7	1201	November 16, 2021
Can't evaluate pruned model for FasterRCNN TAO Toolkit	7	580	October 12, 2021
Inferring resnet18 classification etlt model with python TAO Toolkit	45	3980	October 12, 2021
Inference result gets worse when converting pytorch model to TensorRT model TensorRT pytorch	6	1091	January 19, 2022
Debug TensorRT loading correctly? TensorRT	4	1648	October 11, 2019
LPRNet can't use exported engine file TAO Toolkit	18	2505	December 28, 2021
Outputs of tensorrt are too different according to the compute capabilities TensorRT	1	428	November 2, 2022

Pointpillar engine has weird oscillating performances

Description

Environment

Relevant Files

Steps To Reproduce

Related topics