Missing `ann_file` field while training with Image Classification PyT

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc): NVIDIA GeForce RTX 3070
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : Image Classification PyT
• TLT Version (Please run “tao info --verbose” and share “docker_tag” here): 5.5.0-pyt

Configuration of the TAO Toolkit Instance

task_group:         
    model:             
        dockers:                 
            nvidia/tao/tao-toolkit:                     
                5.5.0-pyt:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. action_recognition
                        2. centerpose
                        3. visual_changenet
                        4. deformable_detr
                        5. dino
                        6. grounding_dino
                        7. mask_grounding_dino
                        8. mask2former
                        9. mal
                        10. ml_recog
                        11. ocdnet
                        12. ocrnet
                        13. optical_inspection
                        14. pointpillars
                        15. pose_classification
                        16. re_identification
                        17. classification_pyt
                        18. segformer
                        19. bevfusion
                5.0.0-tf1.15.5:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. bpnet
                        2. classification_tf1
                        3. converter
                        4. detectnet_v2
                        5. dssd
                        6. efficientdet_tf1
                        7. faster_rcnn
                        8. fpenet
                        9. lprnet
                        10. mask_rcnn
                        11. multitask_classification
                        12. retinanet
                        13. ssd
                        14. unet
                        15. yolo_v3
                        16. yolo_v4
                        17. yolo_v4_tiny
                5.5.0-tf2:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. classification_tf2
                        2. efficientdet_tf2
    dataset:             
        dockers:                 
            nvidia/tao/tao-toolkit:                     
                5.5.0-data-services:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. augmentation
                        2. auto_label
                        3. annotations
                        4. analytics
    deploy:             
        dockers:                 
            nvidia/tao/tao-toolkit:                     
                5.5.0-deploy:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. visual_changenet
                        2. centerpose
                        3. classification_pyt
                        4. classification_tf1
                        5. classification_tf2
                        6. deformable_detr
                        7. detectnet_v2
                        8. dino
                        9. dssd
                        10. efficientdet_tf1
                        11. efficientdet_tf2
                        12. faster_rcnn
                        13. grounding_dino
                        14. mask_grounding_dino
                        15. mask2former
                        16. lprnet
                        17. mask_rcnn
                        18. ml_recog
                        19. multitask_classification
                        20. ocdnet
                        21. ocrnet
                        22. optical_inspection
                        23. retinanet
                        24. segformer
                        25. ssd
                        26. trtexec
                        27. unet
                        28. yolo_v3
                        29. yolo_v4
                        30. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024

• Training spec file(If have, please share here):
spec.txt (2.1 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

docker run -it --rm --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v $PWD:/ws \
nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt  \
classification_pyt train     \
-e /ws/spec.yaml    \
results_dir=/ws/results   \
train.gpu_ids=[0]  \
train.num_gpus=1 

without-ann-file.log (10.9 KB)
with-ann-file.log (3.2 KB)

Hello, I am learning how to fine tune a model via TAO Toolkit. I have followed the instructions as directed in the data annotation format docs and the MMPretrain dataset structure docs.

Both documentation indicate an ann_file field is required if the dataset structure isn’t organized into subfolders, where the class names matches the names of the directories. i.e. If the dataset structure matches the following:

train/
├── folder_1
│   ├── xxx.png
│   ├── xxy.png
│   └── ...
├── 123.png
├── nsdf3.png
└── ...

an ann_file matching the following is required

folder_1/xxx.png 0
folder_1/xxy.png 1
123.png 1
nsdf3.png 2

The dataset.data.train.ann_file field isn’t recognized in the TAO schema, but is recognized for the val/test fields

Yes, your finding matches the code tao_pytorch_backend/nvidia_tao_pytorch/core/mmlab/mmclassification/classification_default_config.py at dc07b02eb78c2eb868315107892b466496e55a0f · NVIDIA/tao_pytorch_backend · GitHub. There is ann_file in validation or inference, but not in training.

From the notebook, it is using the sub-folder format without ann_file. If possible, could you try to set to sub-folder to train?

More, you can try to login the docker via docker run, and then add ann_file inside tao_pytorch_backend/nvidia_tao_pytorch/core/mmlab/mmclassification/classification_default_config.py at dc07b02eb78c2eb868315107892b466496e55a0f · NVIDIA/tao_pytorch_backend · GitHub. You can find this classification_default_config.py under /usr/xxx folder.

Awesome, Thank you. I edited the classification_default_config.py script in /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/mmlab/mmclassification/classification_default_config.py to add an ann_file field to the TrainData dataclass, matching ValData/TestData. I didn’t use the sub-folder method as i was afraid my categories may have invalid characters not suitable for directories.