WSL2 & TAO issues

1442438890 · November 18, 2021, 3:02am

Please provide the following information when requesting support.

• Hardware (RTX2060–>wsl2–>ubuntu20.04)
• Network Type (Classification)
• TLT Version (toolkit_version: 3.21.08; docker(NCCL version 2.7.8+cuda11.1))

365/365 [==============================] - 134s 367ms/step - loss: 2.1175 - acc: 0.4348 - val_loss: 1.6567 - val_acc: 0.5323
c3cd1da7de7d:96:132 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
c3cd1da7de7d:96:132 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
c3cd1da7de7d:96:132 [0] NCCL INFO NET/IB : No device found.
c3cd1da7de7d:96:132 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
c3cd1da7de7d:96:132 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

c3cd1da7de7d:96:132 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/…/…/0000:01:00.0
c3cd1da7de7d:96:132 [0] NCCL INFO graph/xml.cc:469 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO graph/xml.cc:660 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO graph/topo.cc:523 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:581 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:840 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:876 → 2
c3cd1da7de7d:96:132 [0] NCCL INFO init.cc:887 → 2
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 500, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 494, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 495, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 468, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 77, in _average_metrics_in_place
self.backend.get_session().run(self.allreduce_ops[metric])
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0’:
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 500, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 495, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 468, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 73, in _average_metrics_in_place
self._make_variable(metric, value)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 58, in _make_variable
allreduce_op = hvd.allreduce(var, device_dense=self.device)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py”, line 80, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py”, line 86, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File “”, line 80, in horovod_allreduce
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

2021-11-18 10:46:25,079 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · November 18, 2021, 3:25am

For WSL, actually the TAO is not tested or verified internally.
Based on the previous topics from forum end users, TLT2.0 can work well. But TLT 3.0 or TAO 3.0 does not.
Could you pull the TLT 2.0 docker and try?

1442438890 · November 18, 2021, 3:39am

thanks so much

joshH · November 25, 2021, 12:09am

Why are libnccl2 and libnccl-dev not updated in the TAO images?

The newest versions do not exhibit this error and manually updating the image with the newest versions does make it work without any further changes.

Morganh · November 25, 2021, 12:53am

Thanks @joshH for the solution.
I will sync with internal team for this.

Morganh · December 6, 2021, 4:12pm

Thanks @joshH for the hint again.
For latest TAO docker, it can also run in WSL.

Update the nccl via https://developer.nvidia.com/nccl/nccl-download
For example, update to 2.11.4 version.
sudo apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0

1442438890 · December 7, 2021, 10:47am

thanks a lot

1442438890 · December 7, 2021, 11:50am

But how can i upgrade nccl in tao-docker

Morganh · December 7, 2021, 1:12pm

@1442438890
Inside the docker, run below command or similar ones.
# sudo apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0

1442438890 · December 9, 2021, 7:01am

The version of NCCL is bound to the tao-docker image. tao command automatically runs images as a temporary container every time. Can I find the Dokcerfile that is called by tao command and modify it so that it can run NCCL upgrade commands by default every time the container is built?

Morganh · December 9, 2021, 7:54am

Currently, the docker will be downloaded when you run tao command for the first time.
You can find the tao docker image via “docker images”.
It should match one of the images when you run “tao info --verbose”

Upgrading NCCL inside the docker is just a workaround as of now. If you want to load a final docker image every time, you can build a new docker based on the tao docker.

In future release of tao, we’ll update the NCCL.

1442438890 · December 9, 2021, 10:50am

thanks so lot

chongyeh91 · December 12, 2021, 5:47am

Have this been fixed in the new release? I still have the same problem with NCCL

Morganh · December 12, 2021, 2:27pm

@chongyeh91
No, the latest 3.21.11 is not including the update.
Please update the NCCL inside the docker.

chongyeh91 · December 12, 2021, 4:11pm

I am not sure how it can be done…

Do I run “docker images” or “docker container list” to view the docker ID? Then run “docker exec [DOCKER ID] apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0”?

Morganh · December 13, 2021, 6:51am

When you trigger the tao docker, for example
$ tao ssd
then you can find the docker ID via
$ docker ps

Then,
$ docker exec -it <DOCKER ID> /bin/bash
# apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0

Then, the NCCL will be installed in this docker( assume this docker is still running)

joshH · December 17, 2021, 5:30pm

@Morganh No Problem!
NGC shows a updated TAO image (updated 17.12), does this fix it? The version seems unchanged.

Morganh · December 18, 2021, 3:00am

Could you share the exact link?

joshH · December 18, 2021, 11:42am

Of cause: NGC link

I guess the ngc entry was edited and not the image necessarily?

Morganh · December 18, 2021, 1:16pm

Thanks for the info. The dockers are still the ones which are released last month.

Topic		Replies	Views
More than 1 GPU not working using Tao Train TAO Toolkit	47	4587	April 9, 2023
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying TAO Toolkit	29	2713	February 3, 2022
Cannot train Tao Toolkit UNet model in version v4.0.0 and v4.0.1 TAO Toolkit tao	16	729	July 13, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU TAO Toolkit	14	989	November 7, 2023
Multigpu training raises error TAO Toolkit	9	1128	November 15, 2022
TAO API - Detectnet_v2 - Multi GPU Stuck TAO Toolkit	57	1808	August 29, 2023
Invalid Loss TAO Toolkit	31	1299	July 11, 2022
Tao GestureNet train do not work properly TAO Toolkit	2	672	December 9, 2021
Train with my own tlt model #2 TAO Toolkit	42	2783	February 8, 2022
ncclAllReduce failed: unhandled cuda error DGX User Forum	9	4304	May 27, 2021

WSL2 & TAO issues

Related topics