Error in TAO-Toolkit while training

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) 3090
• Network Type ActionRecognitionNet
• TLT Version nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3
• Training spec file(If have, please share here)
train_rgb_3d_finetune.yaml (761 Bytes)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)I am trying to train action recognition net inside TAO toolkit container by following the nvidia blog for ActionRecognitionNet.

I have started the container using the following command in my personal machine:

docker run --name fan-tlt --runtime=nvidia -it -v /var/run/docker.sock:/var/run/docker.sock -v /media/userdata/fanyl/tlt/:/home -p 8888:8888 -w /home nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 /bin/bash

Inside this, I was able to successfully follow the jupyter notebook as mentioned in the blog up till the training part. When i run the following command
print(“Train RGB only model with PTM”)
!tao action_recognition train
-e $SPECS_DIR/train_rgb_3d_finetune.yaml
-r $RESULTS_DIR/rgb_3d_ptm
-k $KEY
model_config.rgb_pretrained_model_path=$RESULTS_DIR/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
#ognition train
model_config.rgb_pretrained_num_classes=5
print(“Train RGB only model with PTM”)

!tao action_recognition train \

              -e $SPECS_DIR/train_rgb_3d_finetune.yaml \

              -r $RESULTS_DIR/rgb_3d_ptm \

              -k $KEY \

              model_config.rgb_pretrained_model_path=$RESULTS_DIR/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt  \

              #ognition train \

              model_config.rgb_pretrained_num_classes=5

I am getting error:

Train RGB only model with PTM
2022-05-20 11:58:15,350 [INFO] root: Registry: [‘nvcr.io’]
2022-05-20 11:58:15,418 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.21.11-py3
2022-05-20 11:58:15,541 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/root/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
ERROR: The indicated experiment spec file /home/tlt-experiments/action_recognition_net/host/specs/train_rgb_3d_finetune.yaml doesn’t exist!
2022-05-20 11:58:18,148 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I can make sure that the file path exists. Why is this mistake?

Since you are running command “!tao action_recognition train xxx” in notebook, that means tao-launcher is used. It is necessary to set correct ~/.tao_mounts.json to map the files.

I find that you are triggering tao docker as below
docker run --name fan-tlt --runtime=nvidia -it -v /var/run/docker.sock:/var/run/docker.sock -v /media/userdata/fanyl/tlt/:/home -p 8888:8888 -w /home [nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3](http://nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3) /bin/bash

You can directly run training without tao-launcher and jupyter notebook, i.e.,
# action_recognition train xxx

Hello, after reading your reply, I have two questions for you to reply to.

The first question is: how to set the correct ~ / tao_ mounts. JSON file. I added specs to the example_ Dir path, but the error still exists. My ~ / tao_ mounts. json is as follows:

Mapping up the local directories to the TAO docker.

import json
import os
mounts_ file = os. path. expanduser("~/.tao_mounts.json")
tlt_ configs = {
“Mounts”:[
{
“source”: os. environ[“HOST_DATA_DIR”],
“destination”: “/data”
},
{
“source”: os. environ[“HOST_SPECS_DIR”],
“destination”: “/specs”
},
{
“source”: os. environ[“HOST_RESULTS_DIR”],
“destination”: “/results”
},
{
“source”: os. path. expanduser("~/.cache"),
“destination”: “/root/.cache”
},
{
“source”: os. environ[“SPECS_DIR”],
“destination”: “/specs”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
}
}
}
with open(mounts_file, “w”) as mfile:
json. dump(tlt_configs, mfile, indent=4)

The second question is: how to use action_ recognition train xxx

When I run directly, the error is reported as follows:

root@c67f146de16e :/home/tlt-experiments/action_ recognition_ net# action_ recognition train -e $SPECS_ DIR/train_ rgb_ 3d_ finetune. yaml -r $RESULTS_ DIR/rgb_ 3d_ ptm -k $KEY model_ config. rgb_ pretrained_ model_ path=$RESULTS_ DIR/pretrained/actionrecognitionnet_ vtrainable_ v1. 0/resnet18_ 3d_ rgb_ hmdb5_ 32.tlt model_ config. rgb_ pretrained_ num_ classes=5

bash: action_ recognition: command not found

This is a supplement to the second question:

root@c67f146de16e :/home/tlt-experiments/action_ recognition_ net# whereis action_ recognition
action_ recognition:

How to add an action_ Recognition command?

See below info.

$ tao info --verbose
Configuration of the TAO Toolkit Instance

dockers:
        nvidia/tao/tao-toolkit-tf:
                v3.21.11-tf1.15.5-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. augment
                                2. bpnet
                                3. classification
                                4. dssd
                                5. emotionnet
                                6. efficientdet
                                7. fpenet
                                8. gazenet
                                9. gesturenet
                                10. heartratenet
                                11. lprnet
                                12. mask_rcnn
                                13. multitask_classification
                                14. retinanet
                                15. ssd
                                16. unet
                                17. yolo_v3
                                18. yolo_v4
                                19. yolo_v4_tiny
                                20. converter
                v3.21.11-tf1.15.4-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. detectnet_v2
                                2. faster_rcnn
        nvidia/tao/tao-toolkit-pyt:
                v3.21.11-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. speech_to_text
                                2. speech_to_text_citrinet
                                3. text_classification
                                4. question_answering
                                5. token_classification
                                6. intent_slot_classification
                                7. punctuation_and_capitalization
                                8. spectro_gen
                                9. vocoder
                                10. action_recognition
        nvidia/tao/tao-toolkit-lm:
                v3.21.08-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. n_gram
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

The action_recognition network is in nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.21.11-py3

So, you need to trigger docker nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.21.11-py3.

That’s why officially we mentioned in the user guide that it is recommended to use tao-launcher to trigger tao docker instead of "docker run xxx ".

For ~/.tao_mounts.json, please refer to TAO Toolkit Launcher — TAO Toolkit 3.22.02 documentation

Hello,

For ~/. tao_ mounts. json, I refer to Tao toolkit launcher - Tao toolkit 3.22.02 documentation. However, I found that the local machine path cannot be found in the docker container, and it can be used when I use the path in the container as the local machine path. I want to know how to configure the address.

The second problem is the error when running outside the container:

root@ubuntu -MS-7B94:/media/userdata/fanyl/tlt/tlt-experiments/action_ recognition_ net1# tao action_ recognition train -e /media/userdata/fanyl/tlt/tlt-experiments/action_ recognition_ net1/specs/train_ rgb_ 3d_ finetune. yaml -r /media/userdata/fanyl/tlt/tlt-experiments/action_ recognition_ net1/results/rgb_ 3d_ ptm -k Zm9xNHQ0ajE1YjI5aGJiNzU4OTZtcDhxdDY6YjhhMTE4NGEtYWJmNi00MGU0LWIxNjAtNmYyNjg2N2JlYjUy model_ config. rgb_ pretrained_ model_ path=/media/userdata/fanyl/tlt/tlt-experiments/action_ recognition_ net1/results/pretrained/actionrecognitionnet_ vtrainable_ v1. 0/resnet18_ 3d_ rgb_ hmdb5_ 32.tlt model_ config. rgb_ pretrained_ num_ classes=5

~/. tao_ mounts. json wasn’t found. Falling back to obtain mount points and docker configs from ~/. tlt_ mounts. json.

Please note that this will be deprecated going forward.

2022-05-21 20:56:42,644 [INFO] root: Registry: [‘nvcr.io’]

2022-05-21 20:56:42,704 [INFO] tlt. components. instance_ handler. local_ instance: Running command in container: nvcr. io/nvidia/tao/tao-toolkit-pyt:v3. 21.11-py3

2022-05-21 20:56:42,825 [INFO] root: No mount points were found in the /root/. tlt_ mounts. json file.

2022-05-21 20:56:42,825 [WARNING] tlt. components. docker_ handler. docker_ handler:

Docker will run the commands as root. If you would like to retain your

local host permissions, please add the “user”:“UID:GID” in the

DockerOptions portion of the “/root/.tlt_mounts.json” file. You can obtain your

users UID and GID by using the “id -u” and “id -g” commands on the

terminal.

ERROR: The indicated experiment spec file /media/userdata/fanyl/tlt/tlt-experiments/action_ recognition_ net1/specs/train_ rgb_ 3d_ finetune. yaml doesn’t exist!

I want to know how to configure it outside the container tao_ mounts. json。

You can create ~/.tao_mounts.json.
Then follow tao user guide to setup correct mapping.

According to your suggestion, I have solved the above problem. However, an error is still reported when running again. The details are as follows:

ubuntu@ubuntu-MS-7B94:/media/userdata/fanyl/cv_samples_vv1.3.0/action_recognition_net1$ tao action_recognition train -e /specs/train_rgb_3d_finetune.yaml -r /results/rgb_3d_ptm -k Zm9xNHQ0ajE1YjI5aGJiNzU4OTZtcDhxdDY6YjhhMTE4NGEtYWJmNi00MGU0LWIxNjAtNmYyNjg2N2JlYjUy model_config.rgb_pretrained_model_path=/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt model_config.rgb_pretrained_num_classes=5
2022-05-22 18:42:36,345 [INFO] root: Registry: [‘nvcr.io’]
2022-05-22 18:42:36,409 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.21.11-py3
2022-05-22 18:42:36,524 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ubuntu/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py:76: UserWarning:
‘train_rgb_3d_finetune.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See Automatic schema-matching | Hydra for migration instructions.
loading trained weights from /results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
Error executing job with overrides: [‘output_dir=/results/rgb_3d_ptm’, ‘encryption_key=Zm9xNHQ0ajE1YjI5aGJiNzU4OTZtcDhxdDY6YjhhMTE4NGEtYWJmNi00MGU0LWIxNjAtNmYyNjg2N2JlYjUy’, ‘model_config.rgb_pretrained_model_path=/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt’, ‘model_config.rgb_pretrained_num_classes=5’]
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 211, in run_and_report
return func()
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 368, in
lambda: hydra.run(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py”, line 110, in run
_ = ret.return_value
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 233, in return_value
raise self._return_value
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 160, in run_job
ret.return_value = task_function(task_cfg)
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py”, line 70, in main
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py”, line 22, in run_experiment
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/pl_ar_model.py”, line 29, in init
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/pl_ar_model.py”, line 36, in _build_model
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/build_nn_model.py”, line 76, in build_ar_model
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/ar_model.py”, line 88, in get_basemodel3d
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/ar_model.py”, line 23, in load_pretrained_weights
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/utils/common_utils.py”, line 22, in patch_decrypt_checkpoint
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tlt_utils/checkpoint_encryption.py”, line 26, in decrypt_checkpoint
_pickle.UnpicklingError: invalid load key, ‘\xf6’.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py”, line 76, in
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/super_resolution/scripts/configs/hydra_runner.py”, line 99, in wrapper
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 367, in _run_hydra
run_and_report(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 251, in run_and_report
assert mdl is not None
AssertionError
2022-05-22 18:42:42,903 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
ubuntu@ubuntu-MS-7B94:/media/userdata/fanyl/cv_samples_vv1.3.0/action_recognition_net1$

Please tell me what I should do?

This kind of error is usually due to wrong mapping setting.

Please check the ~/.tao_mounts.json.

Please note that all the path in the commandline should be the path inside the docker.
The path is defined in ~/.tao_mounts.json.

Or you directly login the docker and run tasks.
$ tao action_recognition
then,
# action_recognition train xxx

After reconfiguring the mapping file, I rerun and the following error is reported:

ubuntu@ubuntu-MS-7B94:~$ tao action_recognition train -e /workspace/tlt-experiments/like/specs/train_rgb_3d_finetune.yaml -r /workspace/tlt-experiments/results/rgb_3d_ptm -k $KEY model_config.rgb_pretrained_model_path=/workspace/tlt-experiments/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt model_config.rgb_pretrained_num_classes=5
2022-05-25 17:31:18,152 [INFO] root: Registry: [‘nvcr.io’]
2022-05-25 17:31:18,226 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.21.11-py3
2022-05-25 17:31:18,345 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ubuntu/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
mismatched input ‘=’ expecting
See basic | Hydra for details

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
2022-05-25 17:31:23,170 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

What is the reason?
Here is my Tao information:

tao info
Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.22.02
published_date: 02/28/2022

I tried again and still reported an error. If there are relevant examples, please send them to me.

print(“Train RGB only model with PTM”)
!tao action_recognition train
-e $SPECS_DIR/train_rgb_3d_finetune.yaml
-r $RESULTS_DIR/rgb_3d_ptm
-k $KEY
model_config.rgb_pretrained_model_path=$RESULTS_DIR/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
model_config.rgb_pretrained_num_classes=5

Train RGB only model with PTM
2022-05-28 21:46:46,098 [INFO] root: Registry: [‘nvcr.io’]
2022-05-28 21:46:46,174 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.21.11-py3
2022-05-28 21:46:46,214 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ubuntu/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py:76: UserWarning:
‘train_rgb_3d_finetune.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See Automatic schema-matching | Hydra for migration instructions.
loading trained weights from /home/action_recognition_net1/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
Error executing job with overrides: [‘output_dir=/home/action_recognition_net1/results/rgb_3d_ptm’, ‘encryption_key=Zm9xNHQ0ajE1YjI5aGJiNzU4OTZtcDhxdDY6YjhhMTE4NGEtYWJmNi00MGU0LWIxNjAtNmYyNjg2N2JlYjUy’, ‘model_config.rgb_pretrained_model_path=/home/action_recognition_net1/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt’, ‘model_config.rgb_pretrained_num_classes=5’]
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 211, in run_and_report
return func()
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 368, in
lambda: hydra.run(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py”, line 110, in run
_ = ret.return_value
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 233, in return_value
raise self._return_value
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 160, in run_job
ret.return_value = task_function(task_cfg)
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py”, line 70, in main
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py”, line 22, in run_experiment
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/pl_ar_model.py”, line 29, in init
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/pl_ar_model.py”, line 36, in _build_model
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/build_nn_model.py”, line 76, in build_ar_model
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/ar_model.py”, line 88, in get_basemodel3d
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/ar_model.py”, line 23, in load_pretrained_weights
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/utils/common_utils.py”, line 22, in patch_decrypt_checkpoint
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tlt_utils/checkpoint_encryption.py”, line 26, in decrypt_checkpoint
_pickle.UnpicklingError: invalid load key, ‘\xf6’.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py”, line 76, in
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/super_resolution/scripts/configs/hydra_runner.py”, line 99, in wrapper
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 367, in _run_hydra
run_and_report(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 251, in run_and_report
assert mdl is not None
AssertionError
2022-05-28 21:46:57,602 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I directly log in to docker and run the task。
Executed a command:

ubuntu@ubuntu-MS-7B94:~$ tao action_recognition

root@ab0479f7c057:/workspace/tlt/samples# action_recognition train -e /home/action_recognition_net1/specs/train_rgb_3d_finetune.yaml -r /home/action_recognition_net1/results/rgb_3d_ptm -k Zm9xNHQ0ajE1YjI5aGJiNzU4OTZtcDhxdDY6YjhhMTE4NGEtYWJmNi00MGU0LWIxNjAtNmYyNjg2N2JlYjUymodel_config.rgb_pretrained_model_path=/home/action_recognition_net1/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt model_config.rgb_pretrained_num_classes=5

Here I encountered a new error, which is roughly as follows:

mismatched input ‘=’ expecting
See basic | Hydra for details

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I didn’t find ~/.tao_mounts.json in the container. Is it a mapping problem?

I am afraid yes. Please try to run with terminal instead of notebook.
And for debugging, you can login the docker run command.
$ tao action_recognition
then inside the docker
# action_recognition train xxx

Firstly, please create ~/.tao_mounts.json and set it correctly.
Then please try to run with terminal instead of notebook.
And for debugging, you can login the docker run command.
$ tao action_recognition
then inside the docker
# action_recognition train xxx