WSL2 & TAO issues

,

@1442438890
Inside the docker, run below command or similar ones.
# sudo apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0

The version of NCCL is bound to the tao-docker image. tao command automatically runs images as a temporary container every time. Can I find the Dokcerfile that is called by tao command and modify it so that it can run NCCL upgrade commands by default every time the container is built?

Currently, the docker will be downloaded when you run tao command for the first time.
You can find the tao docker image via “docker images”.
It should match one of the images when you run “tao info --verbose”

Upgrading NCCL inside the docker is just a workaround as of now. If you want to load a final docker image every time, you can build a new docker based on the tao docker.

In future release of tao, we’ll update the NCCL.

thanks so lot

Have this been fixed in the new release? I still have the same problem with NCCL

@chongyeh91
No, the latest 3.21.11 is not including the update.
Please update the NCCL inside the docker.

I am not sure how it can be done…

Do I run “docker images” or “docker container list” to view the docker ID? Then run “docker exec [DOCKER ID] apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0”?

When you trigger the tao docker, for example
$ tao ssd
then you can find the docker ID via
$ docker ps

Then,
$ docker exec -it <DOCKER ID> /bin/bash
# apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0

Then, the NCCL will be installed in this docker( assume this docker is still running)

@Morganh No Problem!
NGC shows a updated TAO image (updated 17.12), does this fix it? The version seems unchanged.

Could you share the exact link?

Of cause: NGC link

I guess the ngc entry was edited and not the image necessarily?

Thanks for the info. The dockers are still the ones which are released last month.

I just can’t seem to solve this. I figured how to install the lib into the docker. But every time when I try to run “tao detectnet_v2 train” it will pull another docker with an entirely different

If you run inside the tao docker , please run “detectnet_v2 train” directly.

I actually used VSCode and its docker extension to do it.

  1. Run the Image in interactive mode (right click → run interactive)
  2. Run apt-mark unhold libnccl2 libnccl-dev
  3. Run apt update
  4. Run apt upgrade

You then can use docker ps and docker commit <container_id> local/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
in another terminal to generate your own image.
To get tao to use your image create a environment variable (for example in jupyter):
%env OVERRIDE_REGISTRY=local

I remember having to remove a check in the tao souces to avoid checking if I am logged in to the docker repository.

Sadly even with this change and using the workaround displayed here,
the training stops after the first epoch with DSSD (tested just this arch), while evaluation is running. I attached the log.
The line NCCL WARN Cuda failure 'out of memory' does indicate some oom issue, but i confirmed that i do not run out of memory (ram and gpu).

Edit: RetinaNet is the same…

720/720 [==============================] - 174s 242ms/step - loss: 40.4018
f14eea1a9107:58:92 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
f14eea1a9107:58:92 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
f14eea1a9107:58:92 [0] NCCL INFO NET/IB : No device found.
f14eea1a9107:58:92 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
f14eea1a9107:58:92 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.5
f14eea1a9107:58:92 [0] NCCL INFO Channel 00/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 01/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 02/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 03/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 04/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 05/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 06/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 07/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 08/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 09/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 10/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 11/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 12/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 13/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 14/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 15/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 16/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 17/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 18/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 19/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 20/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 21/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 22/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 23/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 24/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 25/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 26/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 27/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 28/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 29/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 30/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Channel 31/32 :    0
f14eea1a9107:58:92 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1

f14eea1a9107:58:92 [0] include/alloc.h:20 NCCL WARN Cuda failure 'out of memory'
f14eea1a9107:58:92 [0] NCCL INFO channel.cc:34 -> 1
f14eea1a9107:58:92 [0] NCCL INFO init.cc:397 -> 1
f14eea1a9107:58:92 [0] NCCL INFO init.cc:800 -> 1
f14eea1a9107:58:92 [0] NCCL INFO init.cc:941 -> 1
f14eea1a9107:58:92 [0] NCCL INFO init.cc:977 -> 1
f14eea1a9107:58:92 [0] NCCL INFO init.cc:990 -> 1
/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (5.120413). Check your callbacks.
  % delta_t_median)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0}}]]
  (1) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0}}]]
	 [[MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0/_21209]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 366, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 528, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 516, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 362, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 274, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py", line 217, in fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 92, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 85, in _average_metrics_in_place
    self.backend.get_session().run(self.allreduce_ops[metric])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0/_21209]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_loss_0':
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 366, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 516, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 362, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 274, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py", line 217, in fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 92, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 81, in _average_metrics_in_place
    self._make_variable(metric, value)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 66, in _make_variable
    allreduce_op = hvd.allreduce(var, device_dense=self.device)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 123, in allreduce
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 121, in _allreduce
    ignore_name_scope=ignore_name_scope)
  File "<string>", line 102, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

@joshH
Can you check nvidia-smi and kill some jobs which consumes GPU memory?

The machine has one 3090 and one GeForce GT 1030, which is just used for display.
I start my training with --gpus 1 --gpu_index 0 and have CUDA_VISIBLE_DEVICES=0, though.
There are no other programs running, batch size is 1, just to be sure.
The job is running in wsl2.

Here is the output of nvidia-smi --query-gpu=memory.used,memory.total --format=csv -i 0 -l 1 just as the error happens:

11022 MiB, 24576 MiB
11022 MiB, 24576 MiB
11034 MiB, 24576 MiB
11078 MiB, 24576 MiB
11014 MiB, 24576 MiB
10220 MiB, 24576 MiB
703 MiB, 24576 MiB
703 MiB, 24576 MiB
703 MiB, 24576 MiB
703 MiB, 24576 MiB
703 MiB, 24576 MiB

I found that using ResNet50-DSSD does yield the out of memory-error, but ResNet18-DSSD does not. I am unsure why that is. nvidia-smi does not show a maxed out memory?

Please try to run other network as well, for example, LPRnet.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.