Error when executing CPU operator TFRecordReader encountered:
[/opt/dali/dali/operators/reader/loader/indexed_file_loader.h:77] Assert on “p != nullptr” failed: Error reading from a file /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00000-of-00010
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f15beb1dace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x2511c78) [0x7f15c0c1bc78]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x250a8c2) [0x7f15c0c148c2]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x250c05b) [0x7f15c0c1605b]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x2bb3def) [0x7f15c12bddef]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f1683763609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f168389f293]
Hi,
Could you please share the full command and log when you generate the tfrecord files?
command:
!tao faster_rcnn train --gpu_index $GPU_INDEX -e $SPECS_DIR/default_spec_resnet18_retrain_spec.txt
issue ouput:
2022-09-01 05:49:16,081 [WARNING] iva.faster_rcnn.utils.utils: Got label marked as difficult(occlusion > 0), please set occlusion field in KITTI label to 0 and re-generate TFRecord dataset, if you want to include it in mAP calculation during validation/evaluation.
360/805 [============>…] - ETA: 5:46 - loss: 0.3276 - rpn_out_class_loss: 0.0077 - rpn_out_regress_loss: 0.0034 - dense_class_td_loss: 0.0441 - dense_regress_td_loss: 0.05282022-09-01 05:54:16.755915: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[fb3ba95d79fb:00058] *** Process received signal ***
[fb3ba95d79fb:00058] Signal: Aborted (6)
[fb3ba95d79fb:00058] Signal code: (-6)
[fb3ba95d79fb:00058] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f1023fc0210]
[fb3ba95d79fb:00058] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f1023fc018b]
[fb3ba95d79fb:00058] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1023f9f859]
[fb3ba95d79fb:00058] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xc1b1788)[0x7f0fc2d4c788]
[fb3ba95d79fb:00058] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x235cb2a)[0x7f0fb8ef7b2a]
[fb3ba95d79fb:00058] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0xbb)[0x7f0fc08f32db]
[fb3ba95d79fb:00058] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x28d)[0x7f0fb5faee6d]
[fb3ba95d79fb:00058] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x4c)[0x7f0fb5fab97c]
[fb3ba95d79fb:00058] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f10232e3de4]
[fb3ba95d79fb:00058] [ 9] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f1023f60609]
[fb3ba95d79fb:00058] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f102409c293]
[fb3ba95d79fb:00058] *** End of error message ***
2022-09-01 13:54:19,573 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
Sorry,
generate the tfrecord files command is:
!tao ssd dataset_convert
-d $SPECS_DIR/ssd_tfrecords_kitti_train.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_train
This is a known random issue for faster_rcnn.
For faster_rcnn , please use 22.05-tf1.15.4 docker instead of 22.05-tf1.15.5 docker.
Please open a terminal and run in the terminal.
Command:
$ docker run --runtime=nvidia -it --rm --entrypoint ""
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3 /bin/bash
then, inside the docker, run the training command. NOTE: need not use “tao” now.
#
faster_rcnn train xxx
Similar topic: TAO crash after driver update - #3 by dbrazey
I has changed the docker env and got a new issue.
the new issue:
…
2022-09-03 03:18:24,348 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
2022-09-03 03:18:32,937 [INFO] root: Starting Training Loop.
Epoch 1/12
2022-09-03 11:21:06,575 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
Can you run in the terminal instead of the notebook?
As mentioned above, please open a terminal and run in the terminal.
Command:
$ docker run --runtime=nvidia -it --rm --entrypoint ""
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3 /bin/bash
then, inside the docker, run the training command. NOTE: need not use “tao” now.
#
faster_rcnn train xxx
Then please share the log.
Hi,
how about this issue
root@9cb2235d0ec2:/workspace# classification train -e /xxx/classification_spec.cfg -r /xxx/output -k xxx
Using TensorFlow backend.
2022-09-08 10:09:53.613143: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-09-08 10:09:53,677 [WARNING] tensorflow: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
File “/usr/local/bin/classification”, line 8, in
sys.exit(main())
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/entrypoint/makenet.py”, line 12, in main
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 263, in launch_job
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 47, in get_modules
File “/usr/lib/python3.6/importlib/init.py”, line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 994, in _gcd_import
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 665, in _load_unlocked
File “”, line 678, in exec_module
File “”, line 219, in _call_with_frames_removed
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/calibration_tensorfile.py”, line 13, in
File “/usr/local/lib/python3.6/dist-packages/keras/init.py”, line 28, in
import third_party.keras.mixed_precision
ModuleNotFoundError: No module named ‘third_party’
Hi,
So, you are changing to classification, right?
For classification, I cannot reproduce the error.
See below log.
$ tao classification run /bin/bash
root@57f7497adc79:/workspace# python
>>> import third_party
>>>
Yes, we decided to use simple classification
Python 3.6.9 (default, Mar 15 2022, 13:55:28)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import third_party
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'third_party'
>>>
There still no third_party module, should I pip install it? But there is no source named third_party
root@9cb2235d0ec2:/workspace# pip3 install third_party
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
ERROR: Could not find a version that satisfies the requirement third_party (from versions: none)
ERROR: No matching distribution found for third_party
Please run below to enter tao docker. It is not needed to install any third_party.
$ tao classification run /bin/bash
Please share the full command and log with me. Thanks.
Before this command , I run the docker login nvcr.io
to confirm my env.
Now the env is docker:v3.22.05-tf1.15.4-py3
And I run your command, receive this
root@9cb2235d0ec2:/workspace# tao classification run /bin/bash
2022-09-09 02:39:43,069 [INFO] root: Registry: ['nvcr.io']
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1331, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1046, in _send_output
self.send(msg)
File "/usr/lib/python3.6/http/client.py", line 984, in send
self.connect()
File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 450, in send
timeout=timeout
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 532, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1331, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1046, in _send_output
self.send(msg)
File "/usr/lib/python3.6/http/client.py", line 984, in send
self.connect()
File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 205, in _retrieve_server_version
return self.version(api_version=False)["ApiVersion"]
File "/usr/local/lib/python3.6/dist-packages/docker/api/daemon.py", line 181, in version
return self._result(self._get(url), json=True)
File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 228, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 542, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 529, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 645, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/tao", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py", line 115, in main
args[1:]
File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 297, in launch_command
docker_handler = self.handler_map[
File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 152, in handler_map
docker_mount_file=os.getenv("LAUNCHER_MOUNTS", DOCKER_MOUNT_FILE)
File "/usr/local/lib/python3.6/dist-packages/tlt/components/docker_handler/docker_handler.py", line 62, in __init__
self._docker_client = docker.from_env()
File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 85, in from_env
timeout=timeout, version=version, **kwargs_from_env(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 40, in __init__
self.api = APIClient(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 188, in __init__
self._version = self._retrieve_server_version()
File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 213, in _retrieve_server_version
'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
Are you running in dgpu device or Jetson devices?
The last few days it worked fine.
The device is RTX5000, and driver version is 515.xx
root@9cb2235d0ec2:/workspace# nvidia-smi -l
Fri Sep 9 02:56:07 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 Off | 00000000:03:00.0 Off | Off |
| 33% 28C P8 1W / 230W | 5MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
I see. According to above log, I am afraid you are triggering tao docker based on another docker.
Can you exit this docker and run “$ tao classification run /bin/bash” again?
Yes , thanks.
But the original issue is the env is 15.5 tf, so the env transfer 15,4 tf.
How to change the tao-toolkit env to 15.4 tf ?
(launcher) root@test-PowerEdge-R730:/home/test# tao classification run /bin/bash
2022-09-09 12:44:53,329 [INFO] root: Registry: ['nvcr.io']
2022-09-09 12:44:53,472 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-09-09 12:44:53,678 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks
It is not needed to change.
You can find the info via “$ tao info --verbose”.
For classification, by default, it is using tf15.5 docker.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.