Tlt 3.0 retrained vehicletypenet, classification net error when loaded pretrained model

sainttelant · May 21, 2021, 5:33am

i downloaded vehicletypenet manually here
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tlt_vehicletypenet/versions/unpruned_v1.0/zip -O tlt_vehicletypenet_unpruned_v1.0.zip

and intended to retrained the model using my own datasets, however the error occurred when i executed the command
!tlt classification train -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY in jupyter notebook, the detailed info is:

Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 260, in decode_to_keras
File “/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py”, line 417, in load_model
f = h5dict(filepath, ‘r’)
File “/usr/local/lib/python3.6/dist-packages/keras/utils/io_utils.py”, line 186, in init
self.data = h5py.File(path, mode=mode)
File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 312, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 142, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “h5py/h5f.pyx”, line 78, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 454, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 449, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 358, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/utils/helper.py”, line 121, in model_io
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 263, in decode_to_keras
OSError: Invalid decryption. Unable to open file (file signature not found). The key used to load the model is incorrect.

Traceback (most recent call last):
File “/usr/local/bin/classification”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/entrypoint/makenet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-05-21 13:25:57,846 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

it looks like the engine is trying to load the hdf5 format model, nevertheless, i loaded the pretrained model in .tlt format, is this error caused by the incompatible model file? By the way, i am sure it is okay to load .tlt format model in object detection task!

here is my full spec file.
model_config {
arch: “resnet”
n_layers: 18
use_bias: True
use_batch_norm: True
all_projections: True
use_pooling: False
freeze_bn: False
freeze_blocks: 0
freeze_blocks: 1
input_image_size: “3,224,224”
}

train_config{
train_dataset_path: “/workspace/tlt-experiments/data/split/train”
val_dataset_path: “/workspace/tlt-experiments/data/split/val”
pretrained_model_path: “/workspace/tlt-experiments/classification/pretrained_resnet18/tlt_vehicletypenet_unpruned_v1.0/resnet18_vehicletypenet.tlt”
optimizer {
sgd {
lr: 0.01
decay: 0.0
momentum: 0.9
nesterov: False
}
}
batch_size_per_gpu: 256
n_epochs: 80
n_workers: 16
reg_config {
type: “L2”
scope: “Conv2D,Dense”
weight_decay: 0.00005
}

learning_rate

lr_config {
step {
learning_rate: 0.006
step_size: 10
gamma: 0.1
}
}

}

eval_config {
eval_dataset_path: “/workspace/tlt-experiments/data/split/test”
model_path: “/workspace/tlt-experiments/classification/output/weights/resnet18_vehicletypenet.tlt”
top_k: 3
batch_size: 256
n_workers: 8
enable_center_crop: True
}

sainttelant · May 21, 2021, 5:39am

@Morganh,
moreover, i slightly modified the spec file on the base of followed lines, i copied from https://ngc.nvidia.com/catalog/models/nvidia:tlt_vehicletypenet

model_config {
  arch: "resnet"
  n_layers: 18
  use_bias: True
  use_batch_norm: True
  all_projections: True
  use_pooling: False
  freeze_bn: False
  freeze_blocks: 0
  freeze_blocks: 1
  input_image_size: "3,224,224"
}

training_config{
  train_dataset_path: "/path/to/your/train/data"
  val_dataset_path: "/path/to/your/val/data"
  pretrained_model_path: "/path/to/your/pretrained/model"
  optimizer {
    sgd {
      lr: 0.01
      decay: 0.0
      momentum: 0.9
      nesterov: False
    }
  }
  batch_size_per_gpu: 256
  n_epochs: 80
  n_workers: 16
  reg_config {
    type: "L2"
    scope: "Conv2D,Dense"
    weight_decay: 0.00005
  }
}
"""

by the way,   "training_config" should be changed to "train_config" after my experiment in tlt 3.0 engine.

Morganh · May 21, 2021, 6:06am

What is the $KEY ?

sainttelant · May 21, 2021, 7:09am

import os

%env KEY=tlt_encode
%env NUM_GPUS=1
%env USER_EXPERIMENT_DIR=/workspace/tlt-experiments/classification

i restarted the jupyter notebook, it works!!!thanks. @Morganh

Topic		Replies	Views
When prune is executed, "OSError: Invalid decryption. Unable to open file (file signature not found). " occurs TAO Toolkit	18	691	April 27, 2022
Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101 TAO Toolkit	7	591	August 27, 2021
Pretrained model file not found TAO Toolkit	2	460	October 12, 2021
Can't evaluate pruned model for FasterRCNN TAO Toolkit	7	580	October 12, 2021
Error loading 'conv1' when training resnet18_ssd? TAO Toolkit	3	814	October 12, 2021
LPRNet Error on Openalpr Dataset while training TAO Toolkit	18	865	October 12, 2021
Get error when training lprnet with TLT3.0 lancher TAO Toolkit	7	540	October 12, 2021
Error while traininig detectnet_v2 with mobilenet_v2 backbone TAO Toolkit	6	633	October 12, 2021
Train with my own tlt model TAO Toolkit	14	709	December 13, 2021
TLT Detectnet TrafficCamNet training not working TAO Toolkit	10	2483	October 12, 2021

Tlt 3.0 retrained vehicletypenet, classification net error when loaded pretrained model

learning_rate

Related topics