TAO 5.3 Segformer results poor

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) RTX A6000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi

I have successfully ran segformer models with custom datasets using TAO 5.2.01. I have exported these models and used in production using Triton.

I have created a new python environment and installed the TAO 5.3.0 launcher. I have tried multiple TAO 5.3 segformer configurations with the same dataset as used for 5.2 and received poor or unstable results. Specifically:

  1. I noticed I had to run the 5.3 segformer with the container running under root privileges.
  2. The pytorch implementation with 5.3 requires all images to be the same size during the validation hook runs. This was not the case with 5.2 and introduces data prep chores.
  3. I used the exact same data set for 5.2 and 5.3, however the 5.3 runs gave very poor (I would say random or numerically unstable) results where as 5.2 gave excellent results.
  4. I could not get any results (i.e. other than NaN) using 5.3 for fan models. I was able to get “results” as per my para 3 by using the mit_b5 backbone (but they were poor/meaningless)

As indicated above, my datasets are custom but I get great results (and I’ve exported to TRT and using in triton successfully in production) from 5.2. The pytorch implementation in 5.3 is definitely different.

Hope this provides some further insights.

cheers

Hi @IainA ,
May I confirm that the poor result is from validation result during training, right?
Could you share logs as well?
I will try to run default 5.3 notebook firstly.

Hi @Morganh

Yes, correct. Here is the log file attached.
20240414_203217.log (273.2 KB)

For TAO5.2, how about the result?
If you did not save the log, you can share the final result for all the classes.

Hi @Morganh

I don’t have the log, but here’s the result on evaluate run (on the cloud instance so validation hook not run during training)

image

OK, got it.

Hi,
I did not observe the major regression in nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt
Attach my log. Maybe you can run isbi dataset as well to confirm.
20240410_forum_289830_tao_5.3_mit_b5.txt (125.5 KB)

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks