[TAO 5] [Object Detection] Can't export a DINO model after training successfully. Missing Layers?

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
NVIDIA RTX A5000

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
DINO Object Detection with fan_backbone

• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
Command ‘tlt’ not found, did you mean:

• Training spec file(If have, please share here)
Training spec (train.yaml)

dataset:
  train_data_sources:
    - image_dir: /workspace/tao-experiments/object_detection/data/train/images/
      json_file: /workspace/tao-experiments/object_detection/data/train/train.json
  val_data_sources:
    - image_dir: /workspace/tao-experiments/object_detection/data/val/images/
      json_file: /workspace/tao-experiments/object_detection/data/val/val.json
  num_classes: 2
  batch_size: 2
  workers: 8
  augmentation:
    scales: [ 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800 ]
    input_mean: [ 0.485, 0.456, 0.406 ]
    input_std: [ 0.229, 0.224, 0.225 ]
    horizontal_flip_prob: 0.5
    train_random_resize: [ 400, 500, 600 ]
    train_random_crop_min: 384
    train_random_crop_max: 600
    random_resize_max_size: 1333
    test_random_resize: 800
model:
  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_imagenet_vfan_hybrid_small/fan_hybrid_small.pth
  backbone: fan_small
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  dropout_ratio: 0.0
  dim_feedforward: 2048
train:
  optim:
    lr_backbone: 2e-5
    lr: 2e-4
    lr_steps: [10]
    momentum: 0.9
  num_epochs: 2
  precision: fp16
  activation_checkpoint: True

Export spec (export.yaml)

export:
  checkpoint: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth
  onnx_file: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model.onnx
  on_cpu: True
  opset_version: 12
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

  1. Pull the docker image
    docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt

  2. Download the pretrained backbone
    ngc registry model download-version nvidia/tao/pretrained_dino_imagenet:fan_hybrid_small --dest ./models/tiny_train/pretrained_resnet18/

  3. Enter the docker image

sudo docker run -it --runtime=nvidia -it -e DISPLAY=$DISPLAY -v ./data/tiny_train:/workspace/tao-experiments/object_detection/data -v ./notebooks_repo/specs_in_use:/workspace/tao-experiments/object_detection/specs -v ./models/tiny_train:/workspace/tao-experiments/object_detection/models -v /tmp/.X11-unix/:/tmp/.X11-unix  -v /dev:/dev  -v /var/run/docker.sock:/var/run/docker.sock  -v /usr/bin/docker:/usr/bin/docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash
  1. Train the model
dino train -e /workspace/tao-experiments/object_detection/specs/_train_.yaml -r /workspace/tao-experiments/object_detection/models/unpruned_resnet18/

(Says it passes, creates checkpoints in ./train/lightning_logs/version_0/checkpoints, creates model files in ./train)

However it complains (expect this is the source of the problem?)
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.fc.weight', 'head.fc.bias'])

  1. Export the model
    dino export -e /workspace/tao-experiments/object_detection/specs/_export_.yaml export.checkpoint=/workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth export.onnx_file=/workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch021.onnx results_dir=/workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/

Complains a lot about missing things from the state dict.
Error(s) in loading state_dict for DINOPlModel:
Missing key(s) in state_dict: “model.model.backbone.0.body.conv1.weight”,
And fails

I’ve included the training logs from inside the docker container and cat 'ed the spec files so you can confirm they are exactly what I am running

Also Attached the coco annotations I am running on incase that is useful

train.json (59.7 KB)
training_log.txt (112.0 KB)

Please let me know if you need anything else!

Thanks for the detailed and clear info.
To narrow down, could you please try another pretrained model?

!ngc registry model download-version nvidia/tao/pretrained_dino_nvimagenet:fan_hybrid_tiny_nvimagenet

It is mentioned in the DINO notebook.https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/dino/dino.ipynb

Given this a go.
No dice i’m afraid

ngc registry model download-version nvidia/tao/pretrained_dino_nvimagenet:fan_hybrid_tiny_nvimagenet --dest ./models/tiny_train/pretrained_resnet18/
model:
#  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_imagenet_vresnet18/resnet18.hdf5
  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_nvimagenet_vfan_hybrid_tiny_nvimagenet/fan_hybrid_tiny_nvimagenetv2.pth.tar
  backbone: fan_tiny
  train_backbone: True

^^ Put that in the training spec following the notebook

Whilst training I still get

Loaded pretrained weights from /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_nvimagenet_vfan_hybrid_tiny_nvimagenet/fan_hybrid_tiny_nvimagenetv2.pth.tar
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.weight', 'head.bias'])

And the export still fails with what appears to be the same error.

Training log attached. Thanks for the quick response.
training_log_new_base_model.txt (104.6 KB)

Please have a check on below pretrained model which has backbone: fan_small. Thanks.
$ wget wget --content-disposition ‘https://api.ngc.nvidia.com/v2/models/org/nvidia/team/tao/pretrained_dino_nvimagenet/fan_small_hybrid_nvimagenet/files?redirect=true&path=fan_small_hybrid_nvimagenet.pth’ -O fan_small_hybrid_nvimagenet.pth

Refer to

Tried this

docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt

wget --content-disposition 'https://api.ngc.nvidia.com/v2/models/org/nvidia/team/tao/pretrained_dino_nvimagenet/fan_small_hybrid_nvimagenet/files?redirect=true&path=fan_small_hybrid_nvimagenet.pth' -O fan_small_hybrid_nvimagenet.pth


Still get

Loaded pretrained weights from /workspace/tao-experiments/object_detection/models/pretrained_resnet18/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.fc.weight', 'head.fc.bias'])

Before training starts

And still get the export error afterwards.

training_log_fan_small.txt (114.3 KB)

root@124b97f30e42:/opt/nvidia/tools# cat /workspace/tao-experiments/object_detection/specs/_export_.yaml
export:
  batch_size: -1
  checkpoint: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth
  input_channel: 3
  input_height: 544
  input_width: 960
  on_cpu: true
  onnx_file: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth.onnx
  opset_version: 12
root@124b97f30e42:/opt/nvidia/tools# cat /workspace/tao-experiments/object_detection/specs/_train_.yaml
dataset:
  train_data_sources:
    - image_dir: /workspace/tao-experiments/object_detection/data/train/images/
      json_file: /workspace/tao-experiments/object_detection/data/train/train.json
  val_data_sources:
    - image_dir: /workspace/tao-experiments/object_detection/data/val/images/
      json_file: /workspace/tao-experiments/object_detection/data/val/val.json
  num_classes: 2
  batch_size: 2
  workers: 8
  augmentation:
    scales: [ 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800 ]
    input_mean: [ 0.485, 0.456, 0.406 ]
    input_std: [ 0.229, 0.224, 0.225 ]
    horizontal_flip_prob: 0.5
    train_random_resize: [ 400, 500, 600 ]
    train_random_crop_min: 384
    train_random_crop_max: 600
    random_resize_max_size: 1333
    test_random_resize: 800
model:
#  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/pretrained_dino_imagenet_vresnet18/resnet18.hdf5
  pretrained_backbone_path: /workspace/tao-experiments/object_detection/models/pretrained_resnet18/fan_small_hybrid_nvimagenet.pth
  backbone: fan_small
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  dropout_ratio: 0.0
  dim_feedforward: 2048
train:
  optim:
    lr_backbone: 2e-5
    lr: 2e-4
    lr_steps: [10]
    momentum: 0.9
  num_epochs: 2
  precision: fp16
  activation_checkpoint: True

Thanks for the info. I will try to reproduce on my side.
If you have time, you can run the default notebook to check if it is still reproduced.

1 Like

Any luck?

Please refer to https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/dino/specs/export.yaml, it is needed to set “dataset” and “model” section.

Ok thank you.

I can now train, export, evaluate, infer, etc so thanks for your help, that’s good progress!

However if I try to create an engine file from the onnx file I get…

[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:748: While parsing node number 302 [Range -> "/model/backbone/backbone.0/body/pos_embed/Range_output_0"]:
[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:749: --- Begin node ---
[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:750: input: "/model/backbone/backbone.0/body/pos_embed/Cast_output_0"
input: "/model/backbone/backbone.0/body/pos_embed/Cast_1_output_0"
input: "/model/backbone/backbone.0/body/pos_embed/Constant_1_output_0"
output: "/model/backbone/backbone.0/body/pos_embed/Range_output_0"
name: "/model/backbone/backbone.0/body/pos_embed/Range"
op_type: "Range"

[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:751: --- End node ---
[09/19/2023-08:18:19] [E] [TRT] ModelImporter.cpp:753: ERROR: ModelImporter.cpp:162 In function parseGraph:
[6] Invalid Node - /model/backbone/backbone.0/body/pos_embed/Range
All inputs to range should be initializers.
[09/19/2023-08:18:19] [E] Failed to parse onnx file
[09/19/2023-08:18:19] [E] Parsing model failed
[09/19/2023-08:18:19] [I] Finish parsing network model
&&&& FAILED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=/models/dino_military.onnx --fp16 --buildOnly --saveEngine=/models/dino_military_orin.engine
[09/19/2023-08:18:19] [E] Failed to create engine from model or file.
[09/19/2023-08:18:19] [E] Engine set up failed

The TensorRT version is 8.4 and the CUDA version is 11.4 on the inference device. Do I need to upgrade?

I did get warnings when I exported the model (attached below)
export errors.txt (80.4 KB)

Could you please share the onnx file?

Also please share the command line when you are going to generate tensorrt engine.

/usr/src/tensorrt/bin/trtexec --onnx=/data/models/dino_military.onnx --fp16 --saveEngine=/data/models/dino_military_orin.engine

^^ command line

BitWarden Send ONNX model

(Although the name of the onnx model does not match the name of the file I linked, they are the same file. I just change the name when I ssh it onto the Inference device)

Thanks. I can download the onnx file now.
More, which device did you generate tensorrt engine? Seems that it is Jetson Orin, right?

Please try to set a larger opset_version when you export to an onnx file. Then try to generate engine in Orin again.

Its over 100MB So I cant upload it here.

It is a Jetson ORIN

I will try OPSET 19 and get back to you shortly unless you have a better suggestion for the OPSET number

So I get the same warnings in the onnx export. Used ONNX Opset Version 13 as I beleive thats the highest you support

Export Spec

dataset:
  batch_size: -1
  num_classes: 2
export:
  gpu_id: 0
  input_height: 544
  input_width: 960
  on_cpu: false
  onnx_file: /workspace/tao-experiments/object_detection/models/unpruned_resnet18/train/dino_model_epoch001.pth.onnx
  opset_version: 13
model:
  backbone: fan_small
  dec_layers: 6
  dim_feedforward: 2048
  dropout_ratio: 0.0
  enc_layers: 6
  num_feature_levels: 4
  num_queries: 300
  num_select: 100
/usr/local/lib/python3.8/dist-packages/third_party/onnx/utils.py:703: UserWarning: Constant folding in symbolic shape inference fails: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select) (Triggered internally at /opt/pytorch/pytorch/torch/csrc/jit/passes/onnx/shape_type_inference.cpp:432.)
  _C._jit_pass_onnx_graph_shape_type_inference(
/usr/local/lib/python3.8/dist-packages/third_party/onnx/utils.py:1194: UserWarning: Constant folding in symbolic shape inference fails: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select) (Triggered internally at /opt/pytorch/pytorch/torch/csrc/jit/passes/onnx/shape_type_inference.cpp:432.)

Starting DINO export.txt (75.1 KB)

This error goes away if I set on_cpu: true
If I change on_cpu to true I get less errors.
It automatically forces the opset version to 16

But I still get the same error when creating the engine file
DINO Export on CPU.txt (94.2 KB)

I can create an engine file on a different device with TensorRT 8.5.3
I will upgrade the inference machine and hopefully this will be the end to this saga

Thanks for the info. The warning info can be ignored when during onnx file generation.
I can generate tensorrt engine successfully with trtexec inside nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt docker with a dgpu machine. The TRT is 8.5.3 in this docker.
For Orin, yes, please use TRT 8.5.3 version.

I am really struggling to get a Orin to play nicely with anything other than the CUDA versions from the latest Jetpack

CUDA 11.4

Is there a way of getting tensorrt 8.5.3 without upgrading to CUDA 11.8 on a Orin device?

Please update to CUDA11.8.

You can reinstall it inside nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel docker to avoid unexpected issue.

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel /bin/bash

Update to cuda from 11.4 to 11.8

$ apt-get --purge remove “cuda” “cublas” “cufft” “curand” “cusolver” “cusparse
$ apt-get update
$ apt-get install vim wget
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/arm64/cuda-ubuntu2004.pin
$ mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
$ dpkg -i cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
$ cp /var/cuda-tegra-repo-ubuntu2004-11-8-local/cuda-tegra-95320BC3-keyring.gpg /usr/share/keyrings/
$ dpkg -i cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
$ apt-get update
$ rm cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
$ apt-get -y install cuda-toolkit-11-8

More info is in CUDA Installation Guide for Linux

Sorry got moved onto other things.

I can create an engine file!
I havent been able to test this as I keep getting this error from Deepstream 6.3 (Jetpack 5.1.2)
But this seems like a deepstream problem, not a TAO problem.

Feel free to help if you have an idea but I’ll mark the solution

Thanks for everything

Deepstream Problem

 Failed to load config file: No such file or directory
deepstreamjetson_1  | ** ERROR: <gst_nvinfer_parse_config_file:1319>: failed
deepstreamjetson_1  | libv4l2: error getting capabilities: Inappropriate ioctl for device
deepstreamjetson_1  | ** ERROR: <main:716>: Failed to set pipeline to PAUSED
deepstreamjetson_1  | Quitting
deepstreamjetson_1  | nvstreammux: Successfully handled EOS for source_id=0
deepstreamjetson_1  | ERROR from sink_sub_bin_encoder2: Error getting capabilities for device '/dev/nvhost-msenc': It isn't a v4l2 driver. Check if it is a v4l1 driver.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.