WSL2 & TAO issues

,

Please provide the following information when requesting support.

• Hardware (RTX2060–>wsl2–>ubuntu20.04)
• Network Type (Classification)
• TLT Version (toolkit_version: 3.21.08; docker(NCCL version 2.7.8+cuda11.1))

365/365 [==============================] - 134s 367ms/step - loss: 2.1175 - acc: 0.4348 - val_loss: 1.6567 - val_acc: 0.5323
c3cd1da7de7d:96:132 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
c3cd1da7de7d:96:132 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
c3cd1da7de7d:96:132 [0] NCCL INFO NET/IB : No device found.
c3cd1da7de7d:96:132 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
c3cd1da7de7d:96:132 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

c3cd1da7de7d:96:132 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/…/…/0000:01:00.0
c3cd1da7de7d:96:132 [0] NCCL INFO graph/xml.cc:469 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO graph/xml.cc:660 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO graph/topo.cc:523 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:581 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:840 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:876 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:887 → 2
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 500, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 494, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 495, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 468, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 77, in _average_metrics_in_place
self.backend.get_session().run(self.allreduce_ops[metric])
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0’:
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 500, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 495, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 468, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 73, in _average_metrics_in_place
self._make_variable(metric, value)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 58, in _make_variable
allreduce_op = hvd.allreduce(var, device_dense=self.device)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py”, line 80, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py”, line 86, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File “”, line 80, in horovod_allreduce
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

2021-11-18 10:46:25,079 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

For WSL, actually the TAO is not tested or verified internally.
Based on the previous topics from forum end users, TLT2.0 can work well. But TLT 3.0 or TAO 3.0 does not.
Could you pull the TLT 2.0 docker and try?

thanks so much

Why are libnccl2 and libnccl-dev not updated in the TAO images?

The newest versions do not exhibit this error and manually updating the image with the newest versions does make it work without any further changes.

Thanks @joshH for the solution.
I will sync with internal team for this.

1 Like

Thanks @joshH for the hint again.
For latest TAO docker, it can also run in WSL.

Update the nccl via https://developer.nvidia.com/nccl/nccl-download
For example, update to 2.11.4 version.
sudo apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0

thanks a lot

But how can i upgrade nccl in tao-docker

@1442438890
Inside the docker, run below command or similar ones.
# sudo apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0

The version of NCCL is bound to the tao-docker image. tao command automatically runs images as a temporary container every time. Can I find the Dokcerfile that is called by tao command and modify it so that it can run NCCL upgrade commands by default every time the container is built?

Currently, the docker will be downloaded when you run tao command for the first time.
You can find the tao docker image via “docker images”.
It should match one of the images when you run “tao info --verbose”

Upgrading NCCL inside the docker is just a workaround as of now. If you want to load a final docker image every time, you can build a new docker based on the tao docker.

In future release of tao, we’ll update the NCCL.

thanks so lot

Have this been fixed in the new release? I still have the same problem with NCCL

@chongyeh91
No, the latest 3.21.11 is not including the update.
Please update the NCCL inside the docker.

I am not sure how it can be done…

Do I run “docker images” or “docker container list” to view the docker ID? Then run “docker exec [DOCKER ID] apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0”?

When you trigger the tao docker, for example
$ tao ssd
then you can find the docker ID via
$ docker ps

Then,
$ docker exec -it <DOCKER ID> /bin/bash
# apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0

Then, the NCCL will be installed in this docker( assume this docker is still running)

@Morganh No Problem!
NGC shows a updated TAO image (updated 17.12), does this fix it? The version seems unchanged.

Could you share the exact link?

Of cause: NGC link

I guess the ngc entry was edited and not the image necessarily?

Thanks for the info. The dockers are still the ones which are released last month.