Using TensorFlow backend.
2020-12-11 03:08:25.123672: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
**Traceback (most recent call last):**
** File “/usr/local/bin/tlt-train-g1”, line 5, in **
** from iva.common.magnet_train import main**
** File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in ** ModuleNotFoundError: No module named ‘third_party’
Successfully completed TLT Yolo example use case. However the Classification and Detectnet use case both fail in the training step. If you cannot reproduce this error on your end and no others have yet reported the problem then how can I go about troublshoting my docker setup.
What is causing magnet_train.py to fail? Where is the module third_party. How do i resolve it or how do I avoid it.
I’ve reproduced the problem at my end numerious times. I’ve shut down the docker container each time. A week age the tlt-train yolo use case completed sucessfully.
Do I have a version problem? I installed the docker container 3 weeks ago.
Using TensorFlow backend.
2020-12-11 17:54:32.348641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 5, in
from iva.common.magnet_train import main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in
ModuleNotFoundError: No module named ‘third_party’
Using TensorFlow backend.
2020-12-11 17:46:55.067487: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 5, in
from iva.common.magnet_train import main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in
ModuleNotFoundError: No module named ‘third_party’
Can you save your jupyter notebook as an html file, then attach here?
More, could you try another experiment: create a new cell in the jupyter notebook, and then paste what you have run well without jupyter notebook, then run the command in the cell? Please use explicit argument without any $…
running on Ubuntu 18.04 and worked very hard to install CUDA 10.1 but something NVIDIA related on the install updated to CUDA 11.0
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 28% 33C P8 7W / 180W | 497MiB / 8116MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 27% 33C P8 5W / 180W | 7MiB / 8119MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+
Ran docker container per the instructions at the above link
Was able to download coco data set using python file provided in container and convert per instructions
Made sure directory structure/paths in spec file for maskrcnn was aligned with setup which was not indicated in the instructions. Only running on 1 GPU sp reduced init_learning_rate by 1/8 to 0.00125 per instructions. Did not make changes to learning_rate_steps that based on reading about linear learning rate scaling suggest the learning_rate_steps should also be scaled. Any additional guidance on what else should change in the spec file when using only 1 GPU would be appreciated.
Ran the following and get the same error as indicated by the others on this thread. Considering that this is being run from docker instance provided by nvidia not sure this type of error is something that we can resolve in particular not very descriptive for ModuleNotFoundError: No module named “third-party”
Any suggestions?
root@40a95783d406:/workspace/examples/maskrcnn# sudo tlt-train mask_rcnn -e /workspace/tlt-experiments/maskrcnn/specs/maskrcnn_train_resnet50.txt -r /workspace/tlt-experiments/maskrcnn/exp1/ -k $KEY --gpus 1
Using TensorFlow backend.
2020-12-19 02:49:34.549658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 5, in
from iva.common.magnet_train import main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 13, in
ModuleNotFoundError: No module named ‘third_party’
Tried that and get an error that I think was related to not being able to access the spec file. Found discussion about adding user to a group to avoid running as sudo. Didn’t spend time trying to find permission issues.
The error is the same as the other two so unless they were running as sudo not sure it is related.
The sudo requirement appears to be related to the docker login step to set the auth key. If you don’t use sudo doesn’t work. The example provided does not include sudo and I spent more than an hour trying to figure out why my copy and paste of a long key wasn’t working. Repeated the same with sudo and worked. I have done the install a couple times and sudo appears to be required to set login credentials.
Execute docker login nvcr.io from the command line and enter these login credentials:
Username: “$oauthtoken”
Password: “YOUR_NGC_API_KEY”
If I don’t use sudo get an argument error which I assume is asking for a -d/–model_dir which isn’t in the original tutorial for the example so assume it part of the encrypted model not being loadable for some reason.
Did the group setup and still same error on permission denied if you don’t use sudo. First two attempts to make sure didn’t copy and paste error on password as the key.
Third attempt with sudo worked. Took out URL info in copy and paste because it is a spam flag for posting.
Screen shot as the forum won’t allow me to post what appear to be links
root@e389f322067a:/workspace# cd tlt-experiments
root@e389f322067a:/workspace/tlt-experiments# sudo tlt-export detectnet_v2 -m resnet34_peoplenet.tlt -k tlt_encode
Using TensorFlow backend.
2020-12-21 23:58:56.176449: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
File "/usr/local/bin/tlt-export", line 5, in <module>
from iva.common.export.app import main
File "/home/obaba/.cache/dazel/_dazel_obaba/e56ee0dba0ec09ac4333617b53ded644/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 13, in <module>
File "/usr/local/lib/python3.6/dist-packages/keras/__init__.py", line 28, in <module>
import third_party.keras.mixed_precision
ModuleNotFoundError: No module named 'third_party'
root@22097c5203ec:/workspace/tlt-experiments# tlt-export detectnet_v2 -m resnet34_peoplenet.tlt -k tlt_encode
Using TensorFlow backend.
2020-12-22 04:45:41.741864: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
File "/usr/local/bin/tlt-export", line 5, in <module>
from iva.common.export.app import main
ImportError: cannot import name 'main'