Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
NVIDIA RTX A5000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
DINO Object Detection with fan_backbone
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
Command ‘tlt’ not found, did you mean:
• Training spec file(If have, please share here)
Training spec (train.yaml)
dataset:
train_data_sources:
- image_dir: /workspace/tao-experiments/object_detection/data/train/images/
json_file: /workspace/tao-experiments/object_detection/data/train/train.json
val_data_sources:
- image_dir: /workspace/tao-experiments/object_detection/data/val/images/
json_file: /workspace/tao-experiments/object_detection/data/val/val.json
num_classes: 2
batch_size: 2
workers: 8
augmentation:
scales: [ 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800 ]
input_mean: [ 0.485, 0.456, 0.406 ]
input_std: [ 0.229, 0.224, 0.225 ]
horizontal_flip_prob: 0.5
train_random_resize: [ 400, 500, 600 ]
train_random_crop_min: 384
train_random_crop_max: 600
random_resize_max_size: 1333
test_random_resize: 800
model:
pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_imagenet_vfan_hybrid_small/fan_hybrid_small.pth
backbone: fan_small
train_backbone: True
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 300
dropout_ratio: 0.0
dim_feedforward: 2048
train:
optim:
lr_backbone: 2e-5
lr: 2e-4
lr_steps: [10]
momentum: 0.9
num_epochs: 2
precision: fp16
activation_checkpoint: True
Export spec (export.yaml)
export:
checkpoint: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth
onnx_file: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model.onnx
on_cpu: True
opset_version: 12
input_channel: 3
input_width: 960
input_height: 544
batch_size: -1
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
-
Pull the docker image
docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
-
Download the pretrained backbone
ngc registry model download-version nvidia/tao/pretrained_dino_imagenet:fan_hybrid_small --dest ./models/tiny_train/pretrained_resnet18/
-
Enter the docker image
sudo docker run -it --runtime=nvidia -it -e DISPLAY=$DISPLAY -v ./data/tiny_train:/workspace/tao-experiments/object_detection/data -v ./notebooks_repo/specs_in_use:/workspace/tao-experiments/object_detection/specs -v ./models/tiny_train:/workspace/tao-experiments/object_detection/models -v /tmp/.X11-unix/:/tmp/.X11-unix -v /dev:/dev -v /var/run/docker.sock:/var/run/docker.sock -v /usr/bin/docker:/usr/bin/docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash
- Train the model
dino train -e /workspace/tao-experiments/object_detection/specs/_train_.yaml -r /workspace/tao-experiments/object_detection/models/unpruned_resnet18/
(Says it passes, creates checkpoints in ./train/lightning_logs/version_0/checkpoints, creates model files in ./train)
However it complains (expect this is the source of the problem?)
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.fc.weight', 'head.fc.bias'])
- Export the model
dino export -e /workspace/tao-experiments/object_detection/specs/_export_.yaml export.checkpoint=/workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth export.onnx_file=/workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch021.onnx results_dir=/workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/
Complains a lot about missing things from the state dict.
Error(s) in loading state_dict for DINOPlModel:
Missing key(s) in state_dict: “model.model.backbone.0.body.conv1.weight”,
And fails
I’ve included the training logs from inside the docker container and cat 'ed the spec files so you can confirm they are exactly what I am running
Also Attached the coco annotations I am running on incase that is useful
train.json (59.7 KB)
training_log.txt (112.0 KB)
Please let me know if you need anything else!