TAO toolkit happend some .so bug

Error when executing CPU operator TFRecordReader encountered:
[/opt/dali/dali/operators/reader/loader/indexed_file_loader.h:77] Assert on “p != nullptr” failed: Error reading from a file /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00000-of-00010
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f15beb1dace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x2511c78) [0x7f15c0c1bc78]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x250a8c2) [0x7f15c0c148c2]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x250c05b) [0x7f15c0c1605b]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x2bb3def) [0x7f15c12bddef]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f1683763609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f168389f293]

Hi,
Could you please share the full command and log when you generate the tfrecord files?

command:

!tao faster_rcnn train --gpu_index $GPU_INDEX -e $SPECS_DIR/default_spec_resnet18_retrain_spec.txt

issue ouput:

2022-09-01 05:49:16,081 [WARNING] iva.faster_rcnn.utils.utils: Got label marked as difficult(occlusion > 0), please set occlusion field in KITTI label to 0 and re-generate TFRecord dataset, if you want to include it in mAP calculation during validation/evaluation.
360/805 [============>…] - ETA: 5:46 - loss: 0.3276 - rpn_out_class_loss: 0.0077 - rpn_out_regress_loss: 0.0034 - dense_class_td_loss: 0.0441 - dense_regress_td_loss: 0.05282022-09-01 05:54:16.755915: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[fb3ba95d79fb:00058] *** Process received signal ***
[fb3ba95d79fb:00058] Signal: Aborted (6)
[fb3ba95d79fb:00058] Signal code: (-6)
[fb3ba95d79fb:00058] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f1023fc0210]
[fb3ba95d79fb:00058] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f1023fc018b]
[fb3ba95d79fb:00058] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1023f9f859]
[fb3ba95d79fb:00058] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xc1b1788)[0x7f0fc2d4c788]
[fb3ba95d79fb:00058] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x235cb2a)[0x7f0fb8ef7b2a]
[fb3ba95d79fb:00058] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0xbb)[0x7f0fc08f32db]
[fb3ba95d79fb:00058] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x28d)[0x7f0fb5faee6d]
[fb3ba95d79fb:00058] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x4c)[0x7f0fb5fab97c]
[fb3ba95d79fb:00058] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f10232e3de4]
[fb3ba95d79fb:00058] [ 9] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f1023f60609]
[fb3ba95d79fb:00058] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f102409c293]
[fb3ba95d79fb:00058] *** End of error message ***
2022-09-01 13:54:19,573 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Sorry,

generate the tfrecord files command is:

!tao ssd dataset_convert
-d $SPECS_DIR/ssd_tfrecords_kitti_train.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_train

This is a known random issue for faster_rcnn.
For faster_rcnn , please use 22.05-tf1.15.4 docker instead of 22.05-tf1.15.5 docker.

Please open a terminal and run in the terminal.
Command:
$ docker run --runtime=nvidia -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3 /bin/bash

then, inside the docker, run the training command. NOTE: need not use “tao” now.
# faster_rcnn train xxx

Similar topic: TAO crash after driver update - #3 by dbrazey

I has changed the docker env and got a new issue.

the new issue:

2022-09-03 03:18:24,348 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2022-09-03 03:18:32,937 [INFO] root: Starting Training Loop.
Epoch 1/12
2022-09-03 11:21:06,575 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.


full image

Can you run in the terminal instead of the notebook?

As mentioned above, please open a terminal and run in the terminal.
Command:
$ docker run --runtime=nvidia -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3 /bin/bash

then, inside the docker, run the training command. NOTE: need not use “tao” now.
# faster_rcnn train xxx

Then please share the log.

Hi,

how about this issue

root@9cb2235d0ec2:/workspace# classification train -e /xxx/classification_spec.cfg -r /xxx/output -k xxx

Using TensorFlow backend.
2022-09-08 10:09:53.613143: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-09-08 10:09:53,677 [WARNING] tensorflow: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
File “/usr/local/bin/classification”, line 8, in
sys.exit(main())
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/entrypoint/makenet.py”, line 12, in main
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 263, in launch_job
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 47, in get_modules
File “/usr/lib/python3.6/importlib/init.py”, line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 994, in _gcd_import
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 665, in _load_unlocked
File “”, line 678, in exec_module
File “”, line 219, in _call_with_frames_removed
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/calibration_tensorfile.py”, line 13, in
File “/usr/local/lib/python3.6/dist-packages/keras/init.py”, line 28, in
import third_party.keras.mixed_precision
ModuleNotFoundError: No module named ‘third_party’

I opened the init.py, find this

third_party couldn’t find in my env

Hi,
So, you are changing to classification, right?

For classification, I cannot reproduce the error.
See below log.

$ tao classification run /bin/bash
root@57f7497adc79:/workspace# python
>>> import third_party
>>>

Yes, we decided to use simple classification

Python 3.6.9 (default, Mar 15 2022, 13:55:28) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import third_party
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'third_party'
>>>

There still no third_party module, should I pip install it? But there is no source named third_party

root@9cb2235d0ec2:/workspace# pip3 install third_party
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
ERROR: Could not find a version that satisfies the requirement third_party (from versions: none)
ERROR: No matching distribution found for third_party

Please run below to enter tao docker. It is not needed to install any third_party.
$ tao classification run /bin/bash

Please share the full command and log with me. Thanks.

Before this command , I run the docker login nvcr.io to confirm my env.
Now the env is docker:v3.22.05-tf1.15.4-py3

And I run your command, receive this

root@9cb2235d0ec2:/workspace# tao classification run /bin/bash
2022-09-09 02:39:43,069 [INFO] root: Registry: ['nvcr.io']
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1046, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 984, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 450, in send
    timeout=timeout
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1046, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 984, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 205, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/usr/local/lib/python3.6/dist-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 228, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 542, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/tao", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py", line 115, in main
    args[1:]
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 297, in launch_command
    docker_handler = self.handler_map[
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 152, in handler_map
    docker_mount_file=os.getenv("LAUNCHER_MOUNTS", DOCKER_MOUNT_FILE)
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/docker_handler/docker_handler.py", line 62, in __init__
    self._docker_client = docker.from_env()
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 85, in from_env
    timeout=timeout, version=version, **kwargs_from_env(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 40, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 188, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 213, in _retrieve_server_version
    'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

Are you running in dgpu device or Jetson devices?

The last few days it worked fine.

The device is RTX5000, and driver version is 515.xx

root@9cb2235d0ec2:/workspace# nvidia-smi -l
Fri Sep  9 02:56:07 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     Off  | 00000000:03:00.0 Off |                  Off |
| 33%   28C    P8     1W / 230W |      5MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I see. According to above log, I am afraid you are triggering tao docker based on another docker.

Can you exit this docker and run “$ tao classification run /bin/bash” again?

Yes , thanks.
But the original issue is the env is 15.5 tf, so the env transfer 15,4 tf.
How to change the tao-toolkit env to 15.4 tf ?

(launcher) root@test-PowerEdge-R730:/home/test# tao classification run /bin/bash
2022-09-09 12:44:53,329 [INFO] root: Registry: ['nvcr.io']
2022-09-09 12:44:53,472 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-09-09 12:44:53,678 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

It is not needed to change.
You can find the info via “$ tao info --verbose”.
For classification, by default, it is using tf15.5 docker.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.