I tried to run the dino.ipynb in tao-getting-started_v5.3, and after 12 epochs training, the evaluation results is very very low, what’s the potential reason for it?
Do you mean you run with default notebook and also the default dataset mentioned in the notebook? Any change on your side? Also, did you use the pretrained model?
For a standard dataset like COCO, please try to follow the guide mentioned in DINO - NVIDIA Docs.
Optimize GPU Memory
There are various ways to optimize GPU memory usage. One obvious trick is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. Hence, we recommend setting below configurations in order to optimize GPU consumption.
Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.
Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.
Set train.distributed_strategy to ddp_sharded to enabled Sharded DDP training. This will share gradient calculation across different processes to help reduce GPU memory.
Try using more lightweight backbones like fan_tiny or freeze the backbone through setting model.train_backbone to False.
Try changing the augmentation resolution in dataset.augmentation depending on your dataset.
You can follow several items mentioned above firstly.
After done, you can trigger more experiments, such as changing num_feature_levels mentioned in notebook, increasing training epochs, etc.
In the “getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/notebooks/tao_launcher_starter_kit/dino”
The classmap.t.xt lists 80 classes, but the spec file define the num_classes as 91.
Is it a problem? And is it a possible reason for the low performance?
Please refer to DINO - NVIDIA Docs.
The category_id from your COCO JSON file should start from 1 because 0 is set as a background class. In addition, dataset.num_classes should be set to max class_id + 1 . For instance, even though there are only 80 classes used in COCO, the largest class_id is 90, so dataset.num_classes should be set to 91.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks