Very low evaluation results for dino model by dino.ipynb in tao-getting-started_v5.3

relaxtheo · June 27, 2024, 11:22am

I tried to run the dino.ipynb in tao-getting-started_v5.3, and after 12 epochs training, the evaluation results is very very low, what’s the potential reason for it?

Evaluation results:
Evaluation metrics generated.
Testing DataLoader 0: 100%|███████████████████| 625/625 [12:26<00:00, 1.19s/it]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ test_class_error │ 69.84188842773438 │
│ test_loss │ 25.781869888305664 │
│ test_loss_bbox │ 0.16190668940544128 │
│ test_loss_ce │ 0.7826069593429565 │
│ test_loss_giou │ 0.6508875489234924 │
│ test_mAP │ 0.0003703692060464326 │
│ test_mAP50 │ 0.0007101921440162097 │
└───────────────────────────┴───────────────────────────┘
Evaluation finished successfully

Morganh · June 28, 2024, 6:33am

Do you mean you run with default notebook and also the default dataset mentioned in the notebook? Any change on your side? Also, did you use the pretrained model?

relaxtheo · June 28, 2024, 7:23am

Yes, I run the default notebook and also the default dataset. I have no change except setting the “LOCAL_PROJECT_DIR”.

And I used the pretrained model: fan_small_hybrid_nvimagenet.pth

Morganh · June 28, 2024, 7:25am

How about the training log? Could you upload it with button

relaxtheo · June 28, 2024, 8:39am

Here is a brief log, is it OK for analyzing?
status.json.txt (8.9 KB)

Morganh · June 28, 2024, 9:36am

I find in the log there is OOM in the first run. How did you change to make it work for the 2nd run?

Also, from your log, the loss is decreasing but it is very slow.
How many gpu did you use and what gpu did you use?

relaxtheo · June 28, 2024, 9:41am

I decrease the batch size from 4 to 2, and I have only one 3090 GPU with 24G memory.

What is the typical loss decreasing rate?

Morganh · June 28, 2024, 10:07am

For a standard dataset like COCO, please try to follow the guide mentioned in DINO - NVIDIA Docs.

Optimize GPU Memory
There are various ways to optimize GPU memory usage. One obvious trick is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. Hence, we recommend setting below configurations in order to optimize GPU consumption.

Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.

Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.

Set train.distributed_strategy to ddp_sharded to enabled Sharded DDP training. This will share gradient calculation across different processes to help reduce GPU memory.

Try using more lightweight backbones like fan_tiny or freeze the backbone through setting model.train_backbone to False.

Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

relaxtheo · June 28, 2024, 10:56am

Thank you for your information, I will try it.

I need to increase the num_epochs for better training results?

Morganh · June 28, 2024, 3:00pm

You can follow several items mentioned above firstly.
After done, you can trigger more experiments, such as changing num_feature_levels mentioned in notebook, increasing training epochs, etc.

relaxtheo · July 8, 2024, 10:19am

I have one more question for dino model.

In the “getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/notebooks/tao_launcher_starter_kit/dino”
The classmap.t.xt lists 80 classes, but the spec file define the num_classes as 91.

Is it a problem? And is it a possible reason for the low performance?

Morganh · July 8, 2024, 4:00pm

Please refer to DINO - NVIDIA Docs.
The category_id from your COCO JSON file should start from 1 because 0 is set as a background class. In addition, dataset.num_classes should be set to max class_id + 1 . For instance, even though there are only 80 classes used in COCO, the largest class_id is 90, so dataset.num_classes should be set to 91.

More, as mentioned in tao_tutorials/notebooks/tao_launcher_starter_kit/dino/dino.ipynb at main · NVIDIA/tao_tutorials · GitHub, “For this demonstration, we changed the architectures from the original implementation so that the training can be completed faster”, so please set num_queries back to 900 .
Please share the training log once you have.

yingliu · August 9, 2024, 6:19am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

system · August 23, 2024, 6:20am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
The tao 5.5.0 launcher dino notebook is not working with default settings TAO Toolkit jetson	4	52	December 4, 2024
TAO action recogniton net trainning extremely slow TAO Toolkit tao	20	661	August 7, 2023
DINO training gives error about insufficient shared memory (shm) TAO Toolkit	16	1290	August 24, 2023
Low accuracy for MS COCO dataset in tao maskrcnn model training TAO Toolkit	10	56	September 17, 2024
Cannot run Dino with tao-5.3.0 TAO Toolkit	7	396	May 17, 2024
Extremely slow train and evaluation of yolo_v4_tiny TAO Toolkit yolo , tao	12	1243	April 12, 2023
TAO Dino training pipeline TAO Toolkit	5	28	August 27, 2024
Add class to pretrained model using TAO 5.5.0 TAO Toolkit	2	26	February 4, 2025
Exec inside tao container TAO Toolkit	16	568	January 26, 2022
Tao Text Classification Evaluate failing TAO Toolkit tao	5	1355	October 12, 2021

Very low evaluation results for dino model by dino.ipynb in tao-getting-started_v5.3

Related topics