WSL2 & TAO issues

,

Please provide the following information when requesting support.

• Hardware (RTX2060–>wsl2–>ubuntu20.04)
• Network Type (Classification)
• TLT Version (toolkit_version: 3.21.08; docker(NCCL version 2.7.8+cuda11.1))

365/365 [==============================] - 134s 367ms/step - loss: 2.1175 - acc: 0.4348 - val_loss: 1.6567 - val_acc: 0.5323
c3cd1da7de7d:96:132 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
c3cd1da7de7d:96:132 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
c3cd1da7de7d:96:132 [0] NCCL INFO NET/IB : No device found.
c3cd1da7de7d:96:132 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
c3cd1da7de7d:96:132 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

c3cd1da7de7d:96:132 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/…/…/0000:01:00.0
c3cd1da7de7d:96:132 [0] NCCL INFO graph/xml.cc:469 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO graph/xml.cc:660 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO graph/topo.cc:523 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:581 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:840 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:876 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:887 → 2
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 500, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 494, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 495, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 468, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 77, in _average_metrics_in_place
self.backend.get_session().run(self.allreduce_ops[metric])
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0’:
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 500, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 495, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 468, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 73, in _average_metrics_in_place
self._make_variable(metric, value)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 58, in _make_variable
allreduce_op = hvd.allreduce(var, device_dense=self.device)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py”, line 80, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py”, line 86, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File “”, line 80, in horovod_allreduce
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

2021-11-18 10:46:25,079 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

For WSL, actually the TAO is not tested or verified internally.
Based on the previous topics from forum end users, TLT2.0 can work well. But TLT 3.0 or TAO 3.0 does not.
Could you pull the TLT 2.0 docker and try?
https://docs.nvidia.com/tao/

thanks so much

Why are libnccl2 and libnccl-dev not updated in the TAO images?

The newest versions do not exhibit this error and manually updating the image with the newest versions does make it work without any further changes.

Thanks @joshH for the solution.
I will sync with internal team for this.

1 Like