TLT V2.0 Classification

Can you double check?
I cannot reproduce your issue. And there is not similar topic raised by other end users.

Successfully completed TLT Yolo example use case. However the Classification and Detectnet use case both fail in the training step. If you cannot reproduce this error on your end and no others have yet reported the problem then how can I go about troublshoting my docker setup.

What is causing magnet_train.py to fail? Where is the module third_party. How do i resolve it or how do I avoid it.

I’ve reproduced the problem at my end numerious times. I’ve shut down the docker container each time. A week age the tlt-train yolo use case completed sucessfully.

Do I have a version problem? I installed the docker container 3 weeks ago.

!tlt-train classification -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY

Using TensorFlow backend.
2020-12-11 17:54:32.348641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 5, in
from iva.common.magnet_train import main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in
ModuleNotFoundError: No module named ‘third_party’

!tlt-train detectnet_v2 -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
-n resnet18_detector
–gpus $NUM_GPUS

Using TensorFlow backend.
2020-12-11 17:46:55.067487: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 5, in
from iva.common.magnet_train import main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in
ModuleNotFoundError: No module named ‘third_party’

Problem report for tlt-export - October 3rd.

appending sudo did not resolve he problem.

Suggest you to narrow down via

  1. double check the https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/requirements_and_installation.html#software-requirements
  2. how did you trigger the docker? Is it https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/requirements_and_installation.html#running-the-transfer-learning-toolkit
  3. Can you run again directly in the docker with command line instead of running it in jupyter notebook?

tlt-train executed to completion from the command line. Using the jupter notebook with latest version of Chrome and Firefox failed. Ubuntu 18.04

image001.jpg

Can you save your jupyter notebook as an html file, then attach here?
More, could you try another experiment: create a new cell in the jupyter notebook, and then paste what you have run well without jupyter notebook, then run the command in the cell? Please use explicit argument without any $…

Following the “mask-rcnn tutorial” at this link https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-the-transfer-learning-toolkit/

running on Ubuntu 18.04 and worked very hard to install CUDA 10.1 but something NVIDIA related on the install updated to CUDA 11.0
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 28% 33C P8 7W / 180W | 497MiB / 8116MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 27% 33C P8 5W / 180W | 7MiB / 8119MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Ran docker container per the instructions at the above link

sudo docker run --runtime=nvidia -it -v /media/techgarage/dldata/tlt-experiments:/workspace/tlt-experiments nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 /bin/bash**

Was able to download coco data set using python file provided in container and convert per instructions

Made sure directory structure/paths in spec file for maskrcnn was aligned with setup which was not indicated in the instructions. Only running on 1 GPU sp reduced init_learning_rate by 1/8 to 0.00125 per instructions. Did not make changes to learning_rate_steps that based on reading about linear learning rate scaling suggest the learning_rate_steps should also be scaled. Any additional guidance on what else should change in the spec file when using only 1 GPU would be appreciated.

Ran the following and get the same error as indicated by the others on this thread. Considering that this is being run from docker instance provided by nvidia not sure this type of error is something that we can resolve in particular not very descriptive for ModuleNotFoundError: No module named “third-party”

Any suggestions?

root@40a95783d406:/workspace/examples/maskrcnn# sudo tlt-train mask_rcnn -e /workspace/tlt-experiments/maskrcnn/specs/maskrcnn_train_resnet50.txt -r /workspace/tlt-experiments/maskrcnn/exp1/ -k $KEY --gpus 1
Using TensorFlow backend.
2020-12-19 02:49:34.549658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 5, in
from iva.common.magnet_train import main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in
ModuleNotFoundError: No module named ‘third_party’

@swillis
Please do not use “sudo” and retry.

Tried that and get an error that I think was related to not being able to access the spec file. Found discussion about adding user to a group to avoid running as sudo. Didn’t spend time trying to find permission issues.

The error is the same as the other two so unless they were running as sudo not sure it is related.

The sudo requirement appears to be related to the docker login step to set the auth key. If you don’t use sudo doesn’t work. The example provided does not include sudo and I spent more than an hour trying to figure out why my copy and paste of a long key wasn’t working. Repeated the same with sudo and worked. I have done the install a couple times and sudo appears to be required to set login credentials.

Execute docker login nvcr.io from the command line and enter these login credentials:

  1. Username: “$oauthtoken”
  2. Password: “YOUR_NGC_API_KEY”

If I don’t use sudo get an argument error which I assume is asking for a -d/–model_dir which isn’t in the original tutorial for the example so assume it part of the encrypted model not being loadable for some reason.

Hi @swillis,
Can you follow https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/requirements_and_installation.html#software-requirements?
Actually “sudo” is not needed when trigger the tlt docker.
Can you attach some logs when you meets error without sudo?

More, for docker, please check if Docker Pull Permission Denied Issue-Can't Download Docker Container can help you.

Yes followed that guide originally where step 6 of setting up key required sudo to work.

Did the group setup and still same error on permission denied if you don’t use sudo. First two attempts to make sure didn’t copy and paste error on password as the key.

Third attempt with sudo worked. Took out URL info in copy and paste because it is a spam flag for posting.

Screen shot as the forum won’t allow me to post what appear to be links

Running from normal logged in user without su and without sudo docker still stuck on permission denied problem

Can you try
$ docker run helloworld

I think you will get the same error result.

Hello.
I’m getting a similar error.

root@e389f322067a:/workspace# cd tlt-experiments
root@e389f322067a:/workspace/tlt-experiments# sudo tlt-export detectnet_v2 -m resnet34_peoplenet.tlt -k tlt_encode
Using TensorFlow backend.
2020-12-21 23:58:56.176449: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
  File "/usr/local/bin/tlt-export", line 5, in <module>
    from iva.common.export.app import main
  File "/home/obaba/.cache/dazel/_dazel_obaba/e56ee0dba0ec09ac4333617b53ded644/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 13, in <module>
  File "/usr/local/lib/python3.6/dist-packages/keras/__init__.py", line 28, in <module>
    import third_party.keras.mixed_precision
ModuleNotFoundError: No module named 'third_party'

I installed it referring to the following two.
[https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/requirements_and_installation.html#installation-prerequisites]
[https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker]
What should i do?

Please do not use “sudo” and retry.

Thank you for your reply

root@22097c5203ec:/workspace/tlt-experiments# tlt-export detectnet_v2 -m resnet34_peoplenet.tlt -k tlt_encode
Using TensorFlow backend.
2020-12-22 04:45:41.741864: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
  File "/usr/local/bin/tlt-export", line 5, in <module>
    from iva.common.export.app import main
ImportError: cannot import name 'main'

Was self resolved.
I changed from yuw-v2, which was updated 3 days ago, to v2.0_py3 and it was successful.
Thank you.