TLT V2.0 Classification

DavidWWalker · December 11, 2020, 3:31am

TLT Classification example use case

Step 3. Run TLT training

ERROR:
!tlt-train classification -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY

Using TensorFlow backend.
2020-12-11 03:08:25.123672: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
**Traceback (most recent call last):**

** File “/usr/local/bin/tlt-train-g1”, line 5, in **
** from iva.common.magnet_train import main**
** File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in **
ModuleNotFoundError: No module named ‘third_party’

Using default-> classification_spec.cfg

Notebook code magnet_train.py cannot find module . Please resolve

Morganh · December 11, 2020, 6:53am

Can you double check?
I cannot reproduce your issue. And there is not similar topic raised by other end users.

DavidWWalker · December 11, 2020, 6:08pm

Successfully completed TLT Yolo example use case. However the Classification and Detectnet use case both fail in the training step. If you cannot reproduce this error on your end and no others have yet reported the problem then how can I go about troublshoting my docker setup.

What is causing magnet_train.py to fail? Where is the module third_party. How do i resolve it or how do I avoid it.

I’ve reproduced the problem at my end numerious times. I’ve shut down the docker container each time. A week age the tlt-train yolo use case completed sucessfully.

Do I have a version problem? I installed the docker container 3 weeks ago.

!tlt-train classification -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY

Using TensorFlow backend.
2020-12-11 17:54:32.348641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 5, in
from iva.common.magnet_train import main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in
ModuleNotFoundError: No module named ‘third_party’

!tlt-train detectnet_v2 -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
-n resnet18_detector
–gpus $NUM_GPUS

Using TensorFlow backend.
2020-12-11 17:46:55.067487: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 5, in
from iva.common.magnet_train import main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in
ModuleNotFoundError: No module named ‘third_party’

DavidWWalker · December 11, 2020, 6:12pm

Problem report for tlt-export - October 3rd.

appending sudo did not resolve he problem.

Morganh · December 12, 2020, 2:13am

Suggest you to narrow down via

double check the Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation
how did you trigger the docker? Is it Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation
Can you run again directly in the docker with command line instead of running it in jupyter notebook?

DavidWWalker · December 15, 2020, 4:51pm

tlt-train executed to completion from the command line. Using the jupter notebook with latest version of Chrome and Firefox failed. Ubuntu 18.04

Morganh · December 16, 2020, 3:05am

Can you save your jupyter notebook as an html file, then attach here?
More, could you try another experiment: create a new cell in the jupyter notebook, and then paste what you have run well without jupyter notebook, then run the command in the cell? Please use explicit argument without any $…

swillis · December 19, 2020, 3:14am

Following the “mask-rcnn tutorial” at this link https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-the-transfer-learning-toolkit/

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Ran docker container per the instructions at the above link

sudo docker run --runtime=nvidia -it -v /media/techgarage/dldata/tlt-experiments:/workspace/tlt-experiments nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 /bin/bash**

Was able to download coco data set using python file provided in container and convert per instructions

Made sure directory structure/paths in spec file for maskrcnn was aligned with setup which was not indicated in the instructions. Only running on 1 GPU sp reduced init_learning_rate by 1/8 to 0.00125 per instructions. Did not make changes to learning_rate_steps that based on reading about linear learning rate scaling suggest the learning_rate_steps should also be scaled. Any additional guidance on what else should change in the spec file when using only 1 GPU would be appreciated.

Ran the following and get the same error as indicated by the others on this thread. Considering that this is being run from docker instance provided by nvidia not sure this type of error is something that we can resolve in particular not very descriptive for ModuleNotFoundError: No module named “third-party”

Any suggestions?

root@40a95783d406:/workspace/examples/maskrcnn# sudo tlt-train mask_rcnn -e /workspace/tlt-experiments/maskrcnn/specs/maskrcnn_train_resnet50.txt -r /workspace/tlt-experiments/maskrcnn/exp1/ -k $KEY --gpus 1
Using TensorFlow backend.
2020-12-19 02:49:34.549658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 5, in
from iva.common.magnet_train import main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in
ModuleNotFoundError: No module named ‘third_party’

Morganh · December 19, 2020, 3:33am

@swillis
Please do not use “sudo” and retry.

swillis · December 19, 2020, 4:52am

Tried that and get an error that I think was related to not being able to access the spec file. Found discussion about adding user to a group to avoid running as sudo. Didn’t spend time trying to find permission issues.

The error is the same as the other two so unless they were running as sudo not sure it is related.

swillis · December 19, 2020, 2:35pm

The sudo requirement appears to be related to the docker login step to set the auth key. If you don’t use sudo doesn’t work. The example provided does not include sudo and I spent more than an hour trying to figure out why my copy and paste of a long key wasn’t working. Repeated the same with sudo and worked. I have done the install a couple times and sudo appears to be required to set login credentials.

Execute docker login nvcr.io from the command line and enter these login credentials:

Username: “$oauthtoken”
Password: “YOUR_NGC_API_KEY”

If I don’t use sudo get an argument error which I assume is asking for a -d/–model_dir which isn’t in the original tutorial for the example so assume it part of the encrypted model not being loadable for some reason.

Morganh · December 19, 2020, 2:40pm

Hi @swillis,
Can you follow Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation?
Actually “sudo” is not needed when trigger the tlt docker.
Can you attach some logs when you meets error without sudo?

Morganh · December 19, 2020, 2:44pm

More, for docker, please check if Docker Pull Permission Denied Issue-Can't Download Docker Container - #2 by Morganh can help you.

swillis · December 19, 2020, 3:24pm

Yes followed that guide originally where step 6 of setting up key required sudo to work.

swillis · December 19, 2020, 3:30pm

Did the group setup and still same error on permission denied if you don’t use sudo. First two attempts to make sure didn’t copy and paste error on password as the key.

Third attempt with sudo worked. Took out URL info in copy and paste because it is a spam flag for posting.

Screen shot as the forum won’t allow me to post what appear to be links

swillis · December 19, 2020, 3:37pm

Running from normal logged in user without su and without sudo docker still stuck on permission denied problem

Morganh · December 19, 2020, 3:43pm

Can you try
$ docker run helloworld

I think you will get the same error result.

emkikuchi21 · December 22, 2020, 12:00am

Hello.
I’m getting a similar error.

root@e389f322067a:/workspace# cd tlt-experiments
root@e389f322067a:/workspace/tlt-experiments# sudo tlt-export detectnet_v2 -m resnet34_peoplenet.tlt -k tlt_encode
Using TensorFlow backend.
2020-12-21 23:58:56.176449: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
  File "/usr/local/bin/tlt-export", line 5, in <module>
    from iva.common.export.app import main
  File "/home/obaba/.cache/dazel/_dazel_obaba/e56ee0dba0ec09ac4333617b53ded644/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 13, in <module>
  File "/usr/local/lib/python3.6/dist-packages/keras/__init__.py", line 28, in <module>
    import third_party.keras.mixed_precision
ModuleNotFoundError: No module named 'third_party'

I installed it referring to the following two.
[Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation]
[Installation Guide — NVIDIA Cloud Native Technologies documentation]
What should i do?

Morganh · December 22, 2020, 1:32am

Please do not use “sudo” and retry.

emkikuchi21 · December 22, 2020, 4:47am

Thank you for your reply

root@22097c5203ec:/workspace/tlt-experiments# tlt-export detectnet_v2 -m resnet34_peoplenet.tlt -k tlt_encode
Using TensorFlow backend.
2020-12-22 04:45:41.741864: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
  File "/usr/local/bin/tlt-export", line 5, in <module>
    from iva.common.export.app import main
ImportError: cannot import name 'main'

Topic		Replies	Views
Train with my own tlt model #2 TAO Toolkit	42	2776	February 8, 2022
Docker instantiation failed when running tao ssd TAO Toolkit	17	928	December 28, 2021
Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101 TAO Toolkit	7	591	August 27, 2021
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1680	July 10, 2023
Error when trying to run gazenet notebook TAO Toolkit	21	2250	October 12, 2021
Tlt lprnet export error, TypeError: set_data_preprocessing_parameters() got an unexpected keyword argument 'image_mean' TAO Toolkit	7	1242	October 12, 2021
Custom TAO unet model classifying only two classes on Deepstream! TAO Toolkit	34	1697	May 12, 2022
Problem with tlt file mounting TAO Toolkit	29	2333	January 6, 2022
Tlt-infer detectnet_v2 fails - TypeError TAO Toolkit	37	1402	October 12, 2021
Not able to deploy .etlt file in deepstream test app 1 TAO Toolkit	12	1818	October 12, 2021

TLT V2.0 Classification

TLT Classification example use case

Related topics