TAO crash after driver update

Hello,

My problem is the continuation as CLI update - #12 by Morganh

Wheen I start training, I get a crash at a random epoch (I tried twice).
!tao faster_rcnn train --gpu_index $GPU_INDEX -e $SPECS_DIR/default_spec_resnet18.txt

Epoch 218/300
48/52 [==========================>…] - ETA: 2s - loss: 0.5245 - rpn_out_class_loss: 0.0188 - rpn_out_regress_loss: 0.0085 - dense_class_td_loss: 0.0536 - dense_regress_td_loss: 0.05342022-06-23 16:41:22.433076: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[72cd7c1307dc:00054] *** Process received signal ***
[72cd7c1307dc:00054] Signal: Aborted (6)
[72cd7c1307dc:00054] Signal code: (-6)
[72cd7c1307dc:00054] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f3f4842c210]
[72cd7c1307dc:00054] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f3f4842c18b]
[72cd7c1307dc:00054] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f3f4840b859]
[72cd7c1307dc:00054] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xc1b1788)[0x7f3eec1ba788]
[72cd7c1307dc:00054] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x235cb2a)[0x7f3ee2365b2a]
[72cd7c1307dc:00054] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0xbb)[0x7f3ee9d612db]
[72cd7c1307dc:00054] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x28d)[0x7f3edf41ce6d]
[72cd7c1307dc:00054] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x4c)[0x7f3edf41997c]
[72cd7c1307dc:00054] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f3f4774fde4]
[72cd7c1307dc:00054] [ 9] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f3f483cc609]
[72cd7c1307dc:00054] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f3f48508293]
[72cd7c1307dc:00054] *** End of error message ***
2022-06-23 18:41:23,295 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

nvidia-smi

±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 On | N/A |
| 0% 37C P8 14W / 170W | 223MiB / 12288MiB | 3% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 918 G /usr/lib/xorg/Xorg 94MiB |
| 0 N/A N/A 1227 G /usr/bin/gnome-shell 24MiB |
| 0 N/A N/A 3579 G /usr/lib/firefox/firefox 102MiB |
±----------------------------------------------------------------------------+

tao info --verbose

Configuration of the TAO Toolkit Instance
dockers:
nvidia/tao/tao-toolkit-tf:
v3.22.05-tf1.15.5-py3:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification
4. dssd
5. faster_rcnn
6. emotionnet
7. efficientdet
8. fpenet
9. gazenet
10. gesturenet
11. heartratenet
12. lprnet
13. mask_rcnn
14. multitask_classification
15. retinanet
16. ssd
17. unet
18. yolo_v3
19. yolo_v4
20. yolo_v4_tiny
21. converter
v3.22.05-tf1.15.4-py3:
docker_registry: nvcr.io
tasks:
1. detectnet_v2
nvidia/tao/tao-toolkit-pyt:
v3.22.05-py3:
docker_registry: nvcr.io
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. speech_to_text_conformer
4. action_recognition
5. pointpillars
6. pose_classification
7. spectro_gen
8. vocoder
v3.21.11-py3:
docker_registry: nvcr.io
tasks:
1. text_classification
2. question_answering
3. token_classification
4. intent_slot_classification
5. punctuation_and_capitalization
nvidia/tao/tao-toolkit-lm:
v3.22.05-py3:
docker_registry: nvcr.io
tasks:
1. n_gram
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022

Do you have any idea, and what the error code “6” refers to ?

Thank you for your help.

For faster_rcnn, please change to 22.05-tf15.4 docker.
You can run something similar to the following.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3 /bin/bash

To run the notebook, I do the following

export PATH=$PATH:/home/pryntec/.local/bin
export WORKON_HOME=~/Envs
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3
source $HOME/.local/bin/virtualenvwrapper.sh
workon launcher

Then,

pip show nvidia-tao

gives the new version

Name: nvidia-tao
Version: 0.1.24
Summary: NVIDIA’s Launcher for TAO Toolkit.
Home-page:
Author: Varun Praveen
Author-email: vpraveen@nvidia.com
License: NVIDIA Proprietary Software
Location: /home/pryntec/Envs/launcher/lib/python3.8/site-packages
Requires: docker, six, tabulate
Required-by:

and I have the following docker images available

REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/tao/tao-toolkit-tf v3.22.05-tf1.15.5-py3 b85103564252 5 weeks ago 11.7GB
nvcr.io/nvidia/tao/tao-toolkit-tf v3.22.05-tf1.15.4-py3 ca92a571a959 5 weeks ago 16.1GB
nvcr.io/nvidia/tao/tao-toolkit-tf v3.21.11-tf1.15.4-py3 fadbda32c62f 7 months ago 16.1GB
hello-world latest feb5d9fea6a5 9 months ago 13.3kB
nvidia/cuda 11.0-base 2ec708416bb8 22 months ago 122MB

However, when I run the docker image that you suggest,

docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3 /bin/bash

I get the following error at the end of the extraction process

chmod: cannot access ‘/opt/ngccli/ngc’: No such file or directory

which is supposed to be solved by the nvidia-tao update. I need to run

docker run --runtime=nvidia -it --rm --entrypoint “” nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3 /bin/bash

to remove the error. Is that normal ?

Moreover, in the notebook, how can I check the current docker image that is running ? (to check that tf1.15.4 is running and not tf15.5)

Thanks

Yes, you can add --entrypoint “” to login the docker.

To check tf15.4, just inside the docker,

$ python
then
>> import tensorflow as tf
>> tf.__version__