MAJOR ACCURACY LOSS when EXPORTING tao unet model after retraining pruned model

I need to check on my side as well.

Hi,
I still cannot reproduce the performance drop.
Please refer to my step as below.

$ tao unet run /bin/bash
Note: I am running with latest tao 22.05 version. Seems that you are running 21.11 version.

Then, train a model.

unet train -e unet_train_resnet_unet_isbi.txt -r /workspace/demo_3.0/forum_repro/unet_isbi/isbi_experiment_unpruned -m /workspace/demo_3.0/forum_repro/unet_isbi/pretrained_model/resnet_18.hdf5 -n model_isbi -k nvidia_tlt

Run evaluation.

unet evaluate -e unet_train_resnet_unet_isbi.txt -m /workspace/demo_3.0/forum_repro/unet_isbi/isbi_experiment_unpruned/weights/model_isbi.tlt -o /workspace/demo_3.0/forum_repro/unet_isbi/isbi_experiment_evaluate -k nvidia_tlt

Result:

root@9deabc2e2957:/workspace/demo_3.0/forum_repro/unet_isbi# cat isbi_experiment_evaluate/results_tlt.json
"{'foreground': {'precision': 0.70791817, 'Recall': 0.77023894, 'F1 Score': 0.7377648243662859, 'iou': 0.5844907}, 'background': {'precision': 0.94187826, 'Recall': 0.92135996, 'F1 Score': 0.9315061321651128, 'iou': 0.8717936}}"

Export the model and generate trt engine.
unet export -m /workspace/demo_3.0/forum_repro/unet_isbi/isbi_experiment_unpruned/weights/model_isbi.tlt -e unet_train_resnet_unet_isbi.txt --engine_file export/trtfp32.isbi.unpruned.engine -k nvidia_tlt

Run evaluation against the trt engine.
unet evaluate -e unet_train_resnet_unet_isbi.txt -m /workspace/demo_3.0/forum_repro/unet_isbi/export/trtfp32.isbi.unpruned.engine -o /workspace/demo_3.0/forum_repro/unet_isbi/isbi_experiment_evaluate_engine -k nvidia_tlt

Result:

root@9deabc2e2957:/workspace/demo_3.0/forum_repro/unet_isbi# cat /workspace/demo_3.0/forum_repro/unet_isbi/isbi_experiment_evaluate_engine/results_trt.json
"{'foreground': {'precision': 0.70795435, 'Recall': 0.77022797, 'F1 Score': 0.7377794104039289, 'iou': 0.58450913}, 'background': {'precision': 0.94187653, 'Recall': 0.92137486, 'F1 Score': 0.931512872531945, 'iou': 0.8718055}}"

I did not understand what you did there. I thought the purpose was for you to replicate my scenario to replicate my problem, but you go and run on something that I don’t even have access to:

In any case,

Somehow there is a discrepancy. If I run 

tao unet run /bin/bash

I get 3,21,11

Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

But running

tao info

Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.22.02
published_date: 02/28/2022

This seems like a dead end… Very frustrating to waste sooooo much time on this just to go around in an endless loop

To look into your issue, we need to check the gap between us. Even with default isbi notebook, there is gap on your side.
When I use latest TAO version(22.05) to try to reproduce the performance gap, but cannot reproduce.

Can you attach the .ipynb and also all the dataset which I can reproduce?

BTW,
For /workspace/demo_3.0/forum_repro/unet_isbi
it is just a folder on my side. You can ignore it.

I believe you are using an old version of TAO.

Please share the result of below command.

$ tao info --verbose

Please install the latest.
$ pip3 install nvidia-tao

Or for 22.05 docker, you can pull
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3

Can you attach the tlt model you have trained with isbi notebook?

NO. Not open source

I will be reinstalling a new workstation soon. It seems to be the only way to isolate the problem, but this whole thing has been very very time consuming and I can’t freeze my project on this.

Understood. For the tlt model you have trained with isbi notebook, if you have kept it, you can share with me. It is trained on isbi dataset. So, there is not copyright issue. Then I can use it to check again in 22.05.
Anyway, I will check if there is issue in 21.11.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,
Indeed, there is performance drop in 21.11 version.

In 21.11,  the tlt model result:
"{'foreground': {'precision': 0.69601804, 'Recall': 0.7722573, 'F1 Score': 0.7321583749601505, 'iou': 0.5774841}, 'background': {'precision': 0.94207376, 'Recall': 0.91653854, 'F1 Score': 0.9291307369557293, 'iou': 0.8676416}}"

the tensorrt engine result:
"{'foreground': {'precision': 0.59434956, 'Recall': 0.6486686, 'F1 Score': 0.6203222234776844, 'iou': 0.44961384}, 'background': {'precision': 0.91104954, 'Recall': 0.89044565, 'F1 Score': 0.9006297727298221, 'iou': 0.81922334}}"

You can consider either of below solutions.

Solution 1 :
Dot not need to train tlt model in 22.05. Just need to use your 21.11 tlt model and run export under 22.05 docker. It will generate a new tensorrt engine. Use it and run evaluation again.
I confirm that there is no performance drop now.

"{'foreground': {'precision': 0.6960101, 'Recall': 0.77224094, 'F1 Score': 0.7321466222923124, 'iou': 0.57746947}, 'background': {'precision': 0.94206977, 'Recall': 0.91653717, 'F1 Score': 0.9291280902702932, 'iou': 0.867637}}"

Solution 2:
Use 22.05 version to do training/evaluation/etc.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.