Tlt 3.0 retrained vehicletypenet, classification net error when loaded pretrained model

i downloaded vehicletypenet manually here
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tlt_vehicletypenet/versions/unpruned_v1.0/zip -O tlt_vehicletypenet_unpruned_v1.0.zip

and intended to retrained the model using my own datasets, however the error occurred when i executed the command
!tlt classification train -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY in jupyter notebook, the detailed info is:

Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 260, in decode_to_keras
File “/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py”, line 417, in load_model
f = h5dict(filepath, ‘r’)
File “/usr/local/lib/python3.6/dist-packages/keras/utils/io_utils.py”, line 186, in init
self.data = h5py.File(path, mode=mode)
File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 312, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 142, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “h5py/h5f.pyx”, line 78, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 454, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 449, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 358, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/utils/helper.py”, line 121, in model_io
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 263, in decode_to_keras
OSError: Invalid decryption. Unable to open file (file signature not found). The key used to load the model is incorrect.

Traceback (most recent call last):
File “/usr/local/bin/classification”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/entrypoint/makenet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-05-21 13:25:57,846 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

it looks like the engine is trying to load the hdf5 format model, nevertheless, i loaded the pretrained model in .tlt format, is this error caused by the incompatible model file? By the way, i am sure it is okay to load .tlt format model in object detection task!

here is my full spec file.
model_config {
arch: “resnet”
n_layers: 18
use_bias: True
use_batch_norm: True
all_projections: True
use_pooling: False
freeze_bn: False
freeze_blocks: 0
freeze_blocks: 1
input_image_size: “3,224,224”
}

train_config{
train_dataset_path: “/workspace/tlt-experiments/data/split/train”
val_dataset_path: “/workspace/tlt-experiments/data/split/val”
pretrained_model_path: “/workspace/tlt-experiments/classification/pretrained_resnet18/tlt_vehicletypenet_unpruned_v1.0/resnet18_vehicletypenet.tlt”
optimizer {
sgd {
lr: 0.01
decay: 0.0
momentum: 0.9
nesterov: False
}
}
batch_size_per_gpu: 256
n_epochs: 80
n_workers: 16
reg_config {
type: “L2”
scope: “Conv2D,Dense”
weight_decay: 0.00005
}

learning_rate

lr_config {
step {
learning_rate: 0.006
step_size: 10
gamma: 0.1
}
}

}

eval_config {
eval_dataset_path: “/workspace/tlt-experiments/data/split/test”
model_path: “/workspace/tlt-experiments/classification/output/weights/resnet18_vehicletypenet.tlt”
top_k: 3
batch_size: 256
n_workers: 8
enable_center_crop: True
}

@Morganh,
moreover, i slightly modified the spec file on the base of followed lines, i copied from NVIDIA NGC

model_config {
  arch: "resnet"
  n_layers: 18
  use_bias: True
  use_batch_norm: True
  all_projections: True
  use_pooling: False
  freeze_bn: False
  freeze_blocks: 0
  freeze_blocks: 1
  input_image_size: "3,224,224"
}

training_config{
  train_dataset_path: "/path/to/your/train/data"
  val_dataset_path: "/path/to/your/val/data"
  pretrained_model_path: "/path/to/your/pretrained/model"
  optimizer {
    sgd {
      lr: 0.01
      decay: 0.0
      momentum: 0.9
      nesterov: False
    }
  }
  batch_size_per_gpu: 256
  n_epochs: 80
  n_workers: 16
  reg_config {
    type: "L2"
    scope: "Conv2D,Dense"
    weight_decay: 0.00005
  }
}
"""

by the way,   "training_config" should be changed to "train_config" after my experiment in tlt 3.0 engine.

What is the $KEY ?

import os

%env KEY=tlt_encode
%env NUM_GPUS=1
%env USER_EXPERIMENT_DIR=/workspace/tlt-experiments/classification

i restarted the jupyter notebook, it works!!!thanks. @Morganh

1 Like