Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct

" ValueError: Invalid model: /workspace/tao-experiments/detectnet_v2/pretrained_resnet50/pretrained_detectnet_v2_vresnet50/resnet50.hdf5, please check the key used to load the model

2023-06-03 18:09:25,239 [INFO] tlt.components.docker_handler.docker_handler: Stopping container."

I suggest you to run below experiment:

  1. train you model again with 1 epoch
  2. export to .etlt model
  3. run tao-converter again.

Using “$KEY”? (it doesn’t recognise the explicit key)

You can check which one can be used to export successfully.

What I mention above is to run a quick experiment to verify the process.
You can train/export with any value. For example,

  1. train you model again with 1 epoch, with key=123
  2. export to .etlt model, with key=123
  3. run tao-converter again, with key=123

Okay, I will do that and get back, but this raises one other question:

“!tao detectnet_v2 train” is the first time one has to use the “-k” argument in this project, so on what basis is it saying that the model is invalid? The model was downloaded from the nVidia registry and that may have required my NGC API key, so does that mean that the resnet50.hdf5 is key-encrypted and therefore dictates the key that one needs to use (in effect, my NGC API key) for everything else that uses it?

No, the .hdf5 is not key-encrypted.

In tao detectnet_v2 train, you can use your NGC API key or any value.


“ValueError: Invalid model: /workspace/tao-experiments/detectnet_v2/pretrained_resnet50/pretrained_detectnet_v2_vresnet50/resnet50.hdf5, please check the key used to load the model”

what does “check the key” mean? I have used “123”, as you suggested above.
please advise

Where did you download resnet50.hdf5? Please try to download it again.

I carried out the step laid out in the jupyter notebook
“!ngc registry model download-version nvidia/tao/pretrained_detectnet_v2:resnet50
–dest $LOCAL_EXPERIMENT_DIR/pretrained_resnet50”

As a sanity check, to be clear, you would like me to:

  1. sudo rm -r the existing pretrained_detectnet_v2_vresnet50 folder in my LOCAL_EXPERIMENT_DIR

  2. Run the cell “!ngc registry model download-version nvidia/tao/pretrained_detectnet_v2:resnet50
    –dest $LOCAL_EXPERIMENT_DIR/pretrained_resnet50”

and then run the 1 epoch experiment that you describe above

please confirm

Can you try below model?
wget ‘https://api.ngc.nvidia.com/v2/models/nvidia/tao/pretrained_detectnet_v2/versions/resnet50/files/resnet50.hdf5

The new model you asked me to download is giving the same result:
" Invalid model: /workspace/tao-experiments/detectnet_v2/pretrained_resnet50/pretrained_detectnet_v2_vresnet50/resnet50.hdf5, please check the key used to load the model"

2023-06-04 17:54:35,041 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I used -k 123. but -k $KEY gives the same result.

Please advise

Could you change to resnet18 and check again? Need to change spec file s well. Thanks for your time.

wget ‘https://api.ngc.nvidia.com/v2/models/nvidia/tao/pretrained_detectnet_v2/versions/resnet18/files/resnet18.hdf5

From TAO Pretrained DetectNet V2 | NVIDIA NGC

I now have resnet50 working in training, using 123 as -k.
The hdF5 downloads inside a folder called “pretrained resnet_50”, which downloaded inside an existing “pretrained resnet_50” folder. I also had to make a small change to the “pretrained_model_file” path in the train kitti.txt.
Finally, a number of folder names seem to vary between the documentation and the jupyter notebook, e.g SPECS_DIR is sometimes referred to LOCAL_SPEC_DIR and LOCAL_PROJECT_DIR is sometimes USER_PROJECT_DIR.
I will now complete the original 3 stage experiment:

And I will get back with the result.

Hi Morganh.
I have hit a small problem at the export stage:
I moved an old experiment_dir_final out of my LOCAL_EXPERIMENT_DIR and then ran the jupyter norebook cell.
The first line of code:
“!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment_dir_final” creates a new empty folder and then the notebook throws a PermissionError associated with this folder.


The permissions are “peter:peter”, which is the same as most of the other folders and files used in this project.

Do you have any idea what the issue might be here?

Refer to Tao Training failing on creating directory on a standard example - #9 by Morganh

Thank you.

The advice contained therein is as follows:


Following the advice and deleting "
,
“DockerOptions”:{
“user”: “1000:1000”
}
gets rid of the Permission issue, but the notebook throws a new error:
“ValueError: Cannot find input file name”
This apparently relates to the -k KEY (see “Can’t export the model to int8”

As you know from above I am using the key “123” as suggested and it is the same key that was used to train a 1 epoch test for this ‘quick experiment’, which is proving to be anything but quick.

Please advise

Not sure if there is something mismatching in the ~/.tao_mounts.json file.

So, I suggest you to run the experiment again in a new terminal instead of notebook.

$ docker run --runtime=nvidia -it --rm -v /your/local/data/path:/docker/data/path nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then, inside the docker, run training for one epoch. All the path and key are set explicitly.
# detectnet_v2 train xxx

Also,
# detectnet_v2 export xxx

This has indeed downloaded the latest version of tao-toolkit, however I am not sure where it downloaded to.
When “detectnet_v2 train” is run from inside the docker I get:


It appears that the syntax for “detectnet_v2 train” isn’t fully recognised inside the docker and it seems to be suggesting that I need to include the task that I want performed. Do I add a flag for “train”, despite explicitly using “train” after “detectnet_v2”?
If so, what is the syntax that I should use?
Thank you

Please try again to type the command in one line instead of multi lines using \.