[TAO 5] [Object Detection] Can't export a DINO model after training successfully. Missing Layers?

joebarham1 · September 12, 2023, 4:07pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
NVIDIA RTX A5000

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
DINO Object Detection with fan_backbone

• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
Command ‘tlt’ not found, did you mean:

• Training spec file(If have, please share here)
Training spec (train.yaml)

dataset:
  train_data_sources:
    - image_dir: /workspace/tao-experiments/object_detection/data/train/images/
      json_file: /workspace/tao-experiments/object_detection/data/train/train.json
  val_data_sources:
    - image_dir: /workspace/tao-experiments/object_detection/data/val/images/
      json_file: /workspace/tao-experiments/object_detection/data/val/val.json
  num_classes: 2
  batch_size: 2
  workers: 8
  augmentation:
    scales: [ 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800 ]
    input_mean: [ 0.485, 0.456, 0.406 ]
    input_std: [ 0.229, 0.224, 0.225 ]
    horizontal_flip_prob: 0.5
    train_random_resize: [ 400, 500, 600 ]
    train_random_crop_min: 384
    train_random_crop_max: 600
    random_resize_max_size: 1333
    test_random_resize: 800
model:
  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_imagenet_vfan_hybrid_small/fan_hybrid_small.pth
  backbone: fan_small
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  dropout_ratio: 0.0
  dim_feedforward: 2048
train:
  optim:
    lr_backbone: 2e-5
    lr: 2e-4
    lr_steps: [10]
    momentum: 0.9
  num_epochs: 2
  precision: fp16
  activation_checkpoint: True

Export spec (export.yaml)

export:
  checkpoint: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth
  onnx_file: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model.onnx
  on_cpu: True
  opset_version: 12
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Pull the docker image
docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
Download the pretrained backbone
ngc registry model download-version nvidia/tao/pretrained_dino_imagenet:fan_hybrid_small --dest ./models/tiny_train/pretrained_resnet18/
Enter the docker image

sudo docker run -it --runtime=nvidia -it -e DISPLAY=$DISPLAY -v ./data/tiny_train:/workspace/tao-experiments/object_detection/data -v ./notebooks_repo/specs_in_use:/workspace/tao-experiments/object_detection/specs -v ./models/tiny_train:/workspace/tao-experiments/object_detection/models -v /tmp/.X11-unix/:/tmp/.X11-unix  -v /dev:/dev  -v /var/run/docker.sock:/var/run/docker.sock  -v /usr/bin/docker:/usr/bin/docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash

Train the model

dino train -e /workspace/tao-experiments/object_detection/specs/_train_.yaml -r /workspace/tao-experiments/object_detection/models/unpruned_resnet18/

(Says it passes, creates checkpoints in ./train/lightning_logs/version_0/checkpoints, creates model files in ./train)

However it complains (expect this is the source of the problem?)
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.fc.weight', 'head.fc.bias'])

Export the model
dino export -e /workspace/tao-experiments/object_detection/specs/_export_.yaml export.checkpoint=/workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth export.onnx_file=/workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch021.onnx results_dir=/workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/

Complains a lot about missing things from the state dict.
Error(s) in loading state_dict for DINOPlModel:
Missing key(s) in state_dict: “model.model.backbone.0.body.conv1.weight”,
And fails

I’ve included the training logs from inside the docker container and cat 'ed the spec files so you can confirm they are exactly what I am running

Also Attached the coco annotations I am running on incase that is useful

train.json (59.7 KB)
training_log.txt (112.0 KB)

Please let me know if you need anything else!

Morganh · September 13, 2023, 7:36am

Thanks for the detailed and clear info.
To narrow down, could you please try another pretrained model?

!ngc registry model download-version nvidia/tao/pretrained_dino_nvimagenet:fan_hybrid_tiny_nvimagenet

It is mentioned in the DINO notebook.https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/dino/dino.ipynb

joebarham1 · September 13, 2023, 9:10am

Given this a go.
No dice i’m afraid

ngc registry model download-version nvidia/tao/pretrained_dino_nvimagenet:fan_hybrid_tiny_nvimagenet --dest ./models/tiny_train/pretrained_resnet18/

model:
#  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_imagenet_vresnet18/resnet18.hdf5
  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_nvimagenet_vfan_hybrid_tiny_nvimagenet/fan_hybrid_tiny_nvimagenetv2.pth.tar
  backbone: fan_tiny
  train_backbone: True

^^ Put that in the training spec following the notebook

Whilst training I still get

Loaded pretrained weights from /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_nvimagenet_vfan_hybrid_tiny_nvimagenet/fan_hybrid_tiny_nvimagenetv2.pth.tar
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.weight', 'head.bias'])

And the export still fails with what appears to be the same error.

Training log attached. Thanks for the quick response.
training_log_new_base_model.txt (104.6 KB)

Morganh · September 13, 2023, 9:47am

Please have a check on below pretrained model which has backbone: fan_small. Thanks.
$ wget wget --content-disposition ‘https://api.ngc.nvidia.com/v2/models/org/nvidia/team/tao/pretrained_dino_nvimagenet/fan_small_hybrid_nvimagenet/files?redirect=true&path=fan_small_hybrid_nvimagenet.pth’ -O fan_small_hybrid_nvimagenet.pth

Refer to

joebarham1 · September 13, 2023, 10:24am

Tried this

docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt

wget --content-disposition 'https://api.ngc.nvidia.com/v2/models/org/nvidia/team/tao/pretrained_dino_nvimagenet/fan_small_hybrid_nvimagenet/files?redirect=true&path=fan_small_hybrid_nvimagenet.pth' -O fan_small_hybrid_nvimagenet.pth

Still get

Loaded pretrained weights from /workspace/tao-experiments/object_detection/models/pretrained_resnet18/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.fc.weight', 'head.fc.bias'])

Before training starts

And still get the export error afterwards.

training_log_fan_small.txt (114.3 KB)

root@124b97f30e42:/opt/nvidia/tools# cat /workspace/tao-experiments/object_detection/specs/_export_.yaml
export:
  batch_size: -1
  checkpoint: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth
  input_channel: 3
  input_height: 544
  input_width: 960
  on_cpu: true
  onnx_file: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth.onnx
  opset_version: 12
root@124b97f30e42:/opt/nvidia/tools# cat /workspace/tao-experiments/object_detection/specs/_train_.yaml
dataset:
  train_data_sources:
    - image_dir: /workspace/tao-experiments/object_detection/data/train/images/
      json_file: /workspace/tao-experiments/object_detection/data/train/train.json
  val_data_sources:
    - image_dir: /workspace/tao-experiments/object_detection/data/val/images/
      json_file: /workspace/tao-experiments/object_detection/data/val/val.json
  num_classes: 2
  batch_size: 2
  workers: 8
  augmentation:
    scales: [ 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800 ]
    input_mean: [ 0.485, 0.456, 0.406 ]
    input_std: [ 0.229, 0.224, 0.225 ]
    horizontal_flip_prob: 0.5
    train_random_resize: [ 400, 500, 600 ]
    train_random_crop_min: 384
    train_random_crop_max: 600
    random_resize_max_size: 1333
    test_random_resize: 800
model:
#  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_imagenet_vresnet18/resnet18.hdf5
  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/fan_small_hybrid_nvimagenet.pth
  backbone: fan_small
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  dropout_ratio: 0.0
  dim_feedforward: 2048
train:
  optim:
    lr_backbone: 2e-5
    lr: 2e-4
    lr_steps: [10]
    momentum: 0.9
  num_epochs: 2
  precision: fp16
  activation_checkpoint: True

Morganh · September 13, 2023, 10:31am

Thanks for the info. I will try to reproduce on my side.
If you have time, you can run the default notebook to check if it is still reproduced.

joebarham1 · September 14, 2023, 11:37am

Any luck?

Morganh · September 17, 2023, 2:15am

Please refer to https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/dino/specs/export.yaml, it is needed to set “dataset” and “model” section.

joebarham1 · September 19, 2023, 9:22am

Ok thank you.

I can now train, export, evaluate, infer, etc so thanks for your help, that’s good progress!

However if I try to create an engine file from the onnx file I get…

[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:748: While parsing node number 302 [Range -> "/model/backbone/backbone.0/body/pos_embed/Range_output_0"]:
[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:749: --- Begin node ---
[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:750: input: "/model/backbone/backbone.0/body/pos_embed/Cast_output_0"
input: "/model/backbone/backbone.0/body/pos_embed/Cast_1_output_0"
input: "/model/backbone/backbone.0/body/pos_embed/Constant_1_output_0"
output: "/model/backbone/backbone.0/body/pos_embed/Range_output_0"
name: "/model/backbone/backbone.0/body/pos_embed/Range"
op_type: "Range"

[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:751: --- End node ---
[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:753: ERROR: ModelImporter.cpp:162 In function parseGraph:
[6] Invalid Node - /model/backbone/backbone.0/body/pos_embed/Range
All inputs to range should be initializers.
[09/19/2023-08:18:19] [E] Failed to parse onnx file
[09/19/2023-08:18:19] [E] Parsing model failed
[09/19/2023-08:18:19] [I] Finish parsing network model
&&&& FAILED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=/models/dino_military.onnx --fp16 --buildOnly --saveEngine=/models/dino_military_orin.engine
[09/19/2023-08:18:19] [E] Failed to create engine from model or file.
[09/19/2023-08:18:19] [E] Engine set up failed

The TensorRT version is 8.4 and the CUDA version is 11.4 on the inference device. Do I need to upgrade?

I did get warnings when I exported the model (attached below)
export errors.txt (80.4 KB)

Morganh · September 19, 2023, 1:55pm

Could you please share the onnx file?

Also please share the command line when you are going to generate tensorrt engine.

joebarham1 · September 19, 2023, 2:13pm

/usr/src/tensorrt/bin/trtexec --onnx=/data/models/dino_military.onnx --fp16 --saveEngine=/data/models/dino_military_orin.engine

^^ command line

BitWarden Send ONNX model

(Although the name of the onnx model does not match the name of the file I linked, they are the same file. I just change the name when I ssh it onto the Inference device)

Morganh · September 19, 2023, 2:36pm

Thanks. I can download the onnx file now.
More, which device did you generate tensorrt engine? Seems that it is Jetson Orin, right?

Please try to set a larger opset_version when you export to an onnx file. Then try to generate engine in Orin again.

joebarham1 · September 19, 2023, 2:45pm

Its over 100MB So I cant upload it here.

It is a Jetson ORIN

I will try OPSET 19 and get back to you shortly unless you have a better suggestion for the OPSET number

joebarham1 · September 19, 2023, 3:39pm

So I get the same warnings in the onnx export. Used ONNX Opset Version 13 as I beleive thats the highest you support

Export Spec

dataset:
  batch_size: -1
  num_classes: 2
export:
  gpu_id: 0
  input_height: 544
  input_width: 960
  on_cpu: false
  onnx_file: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth.onnx
  opset_version: 13
model:
  backbone: fan_small
  dec_layers: 6
  dim_feedforward: 2048
  dropout_ratio: 0.0
  enc_layers: 6
  num_feature_levels: 4
  num_queries: 300
  num_select: 100

/usr/local/lib/python3.8/dist-packages/third_party/onnx/utils.py:703: UserWarning: Constant folding in symbolic shape inference fails: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select) (Triggered internally at /opt/pytorch/pytorch/torch/csrc/jit/passes/onnx/shape_type_inference.cpp:432.)
  _C._jit_pass_onnx_graph_shape_type_inference(
/usr/local/lib/python3.8/dist-packages/third_party/onnx/utils.py:1194: UserWarning: Constant folding in symbolic shape inference fails: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select) (Triggered internally at /opt/pytorch/pytorch/torch/csrc/jit/passes/onnx/shape_type_inference.cpp:432.)

Starting DINO export.txt (75.1 KB)

This error goes away if I set on_cpu: true
If I change on_cpu to true I get less errors.
It automatically forces the opset version to 16

But I still get the same error when creating the engine file
DINO Export on CPU.txt (94.2 KB)

joebarham1 · September 20, 2023, 8:49am

I can create an engine file on a different device with TensorRT 8.5.3
I will upgrade the inference machine and hopefully this will be the end to this saga

Morganh · September 20, 2023, 8:53am

Thanks for the info. The warning info can be ignored when during onnx file generation.
I can generate tensorrt engine successfully with trtexec inside nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt docker with a dgpu machine. The TRT is 8.5.3 in this docker.
For Orin, yes, please use TRT 8.5.3 version.

joebarham1 · September 22, 2023, 1:34pm

I am really struggling to get a Orin to play nicely with anything other than the CUDA versions from the latest Jetpack

CUDA 11.4

Is there a way of getting tensorrt 8.5.3 without upgrading to CUDA 11.8 on a Orin device?

Morganh · September 22, 2023, 2:14pm

Please update to CUDA11.8.

You can reinstall it inside nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel docker to avoid unexpected issue.

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel /bin/bash

Update to cuda from 11.4 to 11.8

$ apt-get --purge remove “cuda” “cublas” “cufft” “curand” “cusolver” “cusparse”
$ apt-get update
$ apt-get install vim wget
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/arm64/cuda-ubuntu2004.pin
$ mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
$ dpkg -i cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
$ cp /var/cuda-tegra-repo-ubuntu2004-11-8-local/cuda-tegra-95320BC3-keyring.gpg /usr/share/keyrings/
$ dpkg -i cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
$ apt-get update
$ rm cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
$ apt-get -y install cuda-toolkit-11-8

More info is in CUDA Installation Guide for Linux

joebarham1 · September 29, 2023, 8:11am

Sorry got moved onto other things.

I can create an engine file!
I havent been able to test this as I keep getting this error from Deepstream 6.3 (Jetpack 5.1.2)
But this seems like a deepstream problem, not a TAO problem.

Feel free to help if you have an idea but I’ll mark the solution

Thanks for everything

Deepstream Problem

 Failed to load config file: No such file or directory
deepstreamjetson_1  | ** ERROR: <gst_nvinfer_parse_config_file:1319>: failed
deepstreamjetson_1  | libv4l2: error getting capabilities: Inappropriate ioctl for device
deepstreamjetson_1  | ** ERROR: <main:716>: Failed to set pipeline to PAUSED
deepstreamjetson_1  | Quitting
deepstreamjetson_1  | nvstreammux: Successfully handled EOS for source_id=0
deepstreamjetson_1  | ERROR from sink_sub_bin_encoder2: Error getting capabilities for device '/dev/nvhost-msenc': It isn't a v4l2 driver. Check if it is a v4l1 driver.

system · October 13, 2023, 8:11am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Deepstream infrence gives no detection TAO Toolkit	28	1935	December 9, 2021
[ERROR] Model has dynamic shape but no optimization profile specified. Aborted (core dumped) TAO Toolkit	30	2031	December 13, 2021
Deploy TAO Classification_pyt FAN for Jetson Nano TAO Toolkit tensorrt , jetson-inference , tao , deepstream	15	368	April 8, 2024
Unable to generate tensorrt engine using ds-tao-detection app for yolov4_tiny for QAT trained etlt model DeepStream SDK	16	567	June 14, 2023
Using Custom action recognition Model in Deepstream 3D action recognition and Getting Error TAO Toolkit	70	933	December 12, 2023
Tao deploy error - TAO Toolkit jetson , deepstream	40	109	March 5, 2025
TAO 5.0 Classification (PyTorch) deploy error TAO Toolkit	49	1457	September 11, 2023
Convert model to Jetson Error during model export step in TAO notebook TAO Toolkit	21	2047	February 15, 2022
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1416	July 12, 2022
Using a onnx model in INT8 mode for jetson Orin AGX TAO Toolkit yolo , onnx , jetson , deepstream	15	1010	May 21, 2024

[TAO 5] [Object Detection] Can't export a DINO model after training successfully. Missing Layers?

Related topics