TAO 5.3 Segformer results poor

IainA · April 16, 2024, 2:42pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) RTX A6000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi

I have successfully ran segformer models with custom datasets using TAO 5.2.01. I have exported these models and used in production using Triton.

I have created a new python environment and installed the TAO 5.3.0 launcher. I have tried multiple TAO 5.3 segformer configurations with the same dataset as used for 5.2 and received poor or unstable results. Specifically:

I noticed I had to run the 5.3 segformer with the container running under root privileges.

The pytorch implementation with 5.3 requires all images to be the same size during the validation hook runs. This was not the case with 5.2 and introduces data prep chores.

I used the exact same data set for 5.2 and 5.3, however the 5.3 runs gave very poor (I would say random or numerically unstable) results where as 5.2 gave excellent results.

I could not get any results (i.e. other than NaN) using 5.3 for fan models. I was able to get “results” as per my para 3 by using the mit_b5 backbone (but they were poor/meaningless)

As indicated above, my datasets are custom but I get great results (and I’ve exported to TRT and using in triton successfully in production) from 5.2. The pytorch implementation in 5.3 is definitely different.

Hope this provides some further insights.

cheers

Morganh · April 16, 2024, 3:00pm

Hi @IainA ,
May I confirm that the poor result is from validation result during training, right?
Could you share logs as well?
I will try to run default 5.3 notebook firstly.

IainA · April 16, 2024, 3:07pm

Hi @Morganh

Yes, correct. Here is the log file attached.
20240414_203217.log (273.2 KB)

Morganh · April 16, 2024, 3:11pm

For TAO5.2, how about the result?
If you did not save the log, you can share the final result for all the classes.

IainA · April 16, 2024, 3:19pm

Hi @Morganh

I don’t have the log, but here’s the result on evaluate run (on the cloud instance so validation hook not run during training)

Morganh · April 16, 2024, 3:45pm

OK, got it.

Morganh · April 21, 2024, 4:29pm

Hi,
I did not observe the major regression in nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt
Attach my log. Maybe you can run isbi dataset as well to confirm.
20240410_forum_289830_tao_5.3_mit_b5.txt (125.5 KB)

yingliu · May 14, 2024, 6:33am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

system · May 28, 2024, 6:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.