Please provide the following information when requesting support.
• Hardware (RTX2060–>wsl2–>ubuntu20.04)
• Network Type (Classification)
• TLT Version (toolkit_version: 3.21.08; docker(NCCL version 2.7.8+cuda11.1))
365/365 [==============================] - 134s 367ms/step - loss: 2.1175 - acc: 0.4348 - val_loss: 1.6567 - val_acc: 0.5323
c3cd1da7de7d:96:132 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
c3cd1da7de7d:96:132 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
c3cd1da7de7d:96:132 [0] NCCL INFO NET/IB : No device found.
c3cd1da7de7d:96:132 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
c3cd1da7de7d:96:132 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
c3cd1da7de7d:96:132 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/…/…/0000:01:00.0
c3cd1da7de7d:96:132 [0] NCCL INFO graph/xml.cc:469 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO graph/xml.cc:660 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO graph/topo.cc:523 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:581 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:840 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:876 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:887 → 2
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 500, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 494, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 495, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 468, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 77, in _average_metrics_in_place
self.backend.get_session().run(self.allreduce_ops[metric])
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.
Original stack trace for ‘MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0’:
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 500, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 495, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 468, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 73, in _average_metrics_in_place
self._make_variable(metric, value)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 58, in _make_variable
allreduce_op = hvd.allreduce(var, device_dense=self.device)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py”, line 80, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py”, line 86, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File “”, line 80, in horovod_allreduce
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()
2021-11-18 10:46:25,079 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.