Description
I downloaded tao_pytorch_backend repository and I trained pointpillar from scratch on a custom version of KITTI dataset (I retained car annotations only) and then fine-tuned on a smaller custom dataset.
My final model has good performances both with tlt and its converted version in trt. However, by running the evaluation multiple times with the same trt engine (fp32) and same validation set (430 samples) I obtain oscillating metrics. I investigated trying to narrow down the problem and this is what I found so far using always fp32 trt engine:
- Running 100 times evaluation on 5 validation samples shows that 49 times there’s a specific sample (let’s call it AnomalousSample0) that diverge from the original tlt model result.
- Running 100 times evaluation on AnomalousSample0 ALONE shows that it never diverges from the original tlt model result.
- Running 100 times evaluation on another sample of the initial 5-samples-val-set that didn’t give any problem even in batch, showed no problem as expected
- Running 100 times evaluation on 10 validation samples (first 5 samples same as in first point + 5 more) shows that 3 samples diverge from the original tlt model result: AnomalousSample0 (same sample as before)16/100 times, AnomalousSample1 1 time, AnomalousSample2 100 times.
- Running 100 times evaluation on 5 selected samples that never showed divergence in previous tests don’t show any divergence even in this 100-times evaluation
This seems something related to specific samples but when they are evaluated together with other samples only…
Environment
TensorRT Version: 8.6.1
GPU Type: NVIDIA GeForce RTX 4080
Nvidia Driver Version: nvidia-driver-545
CUDA Version: CUDA Driver Version / Runtime Version 12.3 / 11.8
CUDNN Version: CUDNN_MAJOR 8, CUDNN_MINOR 9
Operating System + Version: Ubuntu 22.04.4 LTS
Python Version (if applicable): 3.10.12
PyTorch Version (if applicable): 2.2.0a0+81ea7a4
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt-base
Relevant Files
At this link you’ll find:
- ten samples validation set I used for above mentioned experiments
ten_val_set
- tlt model
checkpoint_epoch_30.tlt
- engine model
checkpoint_epoch_30.engine
- used configuration file
pointpillar_general.yaml
Steps To Reproduce
python nvidia_tao_pytorch/pointcloud/pointpillars/scripts/evaluate.py --cfg_file pointpillar_general.yaml --save_to_file --output_dir path/to/output/ --key tlt_encode --trt_engine checkpoint_epoch_30.engine