TLT 3.0 & WSL2 issues

I am facing the below mentioned issue when executing TLT 3.0 sample notebook for classification using wsl2 Ubuntu 20.04.
Same sample notebook runs well on standalone Ubuntu 20.04

Can someone take a look at this.

• Hardware (PC : AMD RYZEN 3900X, 64 GB RAM, RTX 3090, Windows 10 INSIDER PREVIEW with WSL2, wsl2 kernel version 5.10.43.3-microsoft-standard-WSL2 , GPU driver 471.11, 470.76)
• Network Type (Classification)
• TLT Version (v3.0-py3)
• Training spec file(I have used the classification example Jupyter Notebook downloaded following this guide TLT Quick Start Guide — Transfer Learning Toolkit 3.0 documentation (nvidia.com))
• How to reproduce the issue ? Just execute the sample Jupyter Notebook for Classification under WSl2 Ubuntu 20.04 with latest docker and Nvidia container libs installed in it.
You would get the following output.

(2021-07-08 07:47:05,713 [INFO] root: Registry: [‘nvcr.io’]
2021-07-08 07:47:05,756 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn’t exist locally/the manifest has changed. Pulling a new docker.
2021-07-08 07:47:05,756 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you’re doing this for the first time. Please wait here.

Repository name: nvcr.io/nvidia/tlt-streamanalytics
2021-07-08 08:18:09,152 [INFO] tlt.components.docker_handler.docker_handler: Container pull complete.
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-my3ozdff because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

[‘model_config’, ‘train_config’]
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:281: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:290: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:302: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

2021-07-08 02:48:16,849 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:302: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:302: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

2021-07-08 02:48:16,849 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:302: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

Found 11667 images belonging to 20 classes.
2021-07-08 02:48:17,578 [INFO] main: Processing dataset (train): /workspace/tlt-experiments/data/split/train
Found 1670 images belonging to 20 classes.
2021-07-08 02:48:17,791 [INFO] main: Processing dataset (validation): /workspace/tlt-experiments/data/split/val
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2021-07-08 02:48:17,791 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2021-07-08 02:48:17,792 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2021-07-08 02:48:17,805 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2021-07-08 02:48:17,809 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

2021-07-08 02:48:18,341 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2021-07-08 02:48:18,890 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2021-07-08 02:48:18,890 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2021-07-08 02:48:19,104 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 3, 224, 224) 0


conv1 (Conv2D) (None, 64, 112, 112) 9408 input_1[0][0]


bn_conv1 (BatchNormalization) (None, 64, 112, 112) 256 conv1[0][0]


activation_1 (Activation) (None, 64, 112, 112) 0 bn_conv1[0][0]


block_1a_conv_1 (Conv2D) (None, 64, 56, 56) 36864 activation_1[0][0]


block_1a_bn_1 (BatchNormalizati (None, 64, 56, 56) 256 block_1a_conv_1[0][0]


block_1a_relu_1 (Activation) (None, 64, 56, 56) 0 block_1a_bn_1[0][0]


block_1a_conv_2 (Conv2D) (None, 64, 56, 56) 36864 block_1a_relu_1[0][0]


block_1a_conv_shortcut (Conv2D) (None, 64, 56, 56) 4096 activation_1[0][0]


block_1a_bn_2 (BatchNormalizati (None, 64, 56, 56) 256 block_1a_conv_2[0][0]


block_1a_bn_shortcut (BatchNorm (None, 64, 56, 56) 256 block_1a_conv_shortcut[0][0]


add_1 (Add) (None, 64, 56, 56) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]


block_1a_relu (Activation) (None, 64, 56, 56) 0 add_1[0][0]


block_1b_conv_1 (Conv2D) (None, 64, 56, 56) 36864 block_1a_relu[0][0]


block_1b_bn_1 (BatchNormalizati (None, 64, 56, 56) 256 block_1b_conv_1[0][0]


block_1b_relu_1 (Activation) (None, 64, 56, 56) 0 block_1b_bn_1[0][0]


block_1b_conv_2 (Conv2D) (None, 64, 56, 56) 36864 block_1b_relu_1[0][0]


block_1b_conv_shortcut (Conv2D) (None, 64, 56, 56) 4096 block_1a_relu[0][0]


block_1b_bn_2 (BatchNormalizati (None, 64, 56, 56) 256 block_1b_conv_2[0][0]


block_1b_bn_shortcut (BatchNorm (None, 64, 56, 56) 256 block_1b_conv_shortcut[0][0]


add_2 (Add) (None, 64, 56, 56) 0 block_1b_bn_2[0][0]
block_1b_bn_shortcut[0][0]


block_1b_relu (Activation) (None, 64, 56, 56) 0 add_2[0][0]


block_2a_conv_1 (Conv2D) (None, 128, 28, 28) 73728 block_1b_relu[0][0]


block_2a_bn_1 (BatchNormalizati (None, 128, 28, 28) 512 block_2a_conv_1[0][0]


block_2a_relu_1 (Activation) (None, 128, 28, 28) 0 block_2a_bn_1[0][0]


block_2a_conv_2 (Conv2D) (None, 128, 28, 28) 147456 block_2a_relu_1[0][0]


block_2a_conv_shortcut (Conv2D) (None, 128, 28, 28) 8192 block_1b_relu[0][0]


block_2a_bn_2 (BatchNormalizati (None, 128, 28, 28) 512 block_2a_conv_2[0][0]


block_2a_bn_shortcut (BatchNorm (None, 128, 28, 28) 512 block_2a_conv_shortcut[0][0]


add_3 (Add) (None, 128, 28, 28) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]


block_2a_relu (Activation) (None, 128, 28, 28) 0 add_3[0][0]


block_2b_conv_1 (Conv2D) (None, 128, 28, 28) 147456 block_2a_relu[0][0]


block_2b_bn_1 (BatchNormalizati (None, 128, 28, 28) 512 block_2b_conv_1[0][0]


block_2b_relu_1 (Activation) (None, 128, 28, 28) 0 block_2b_bn_1[0][0]


block_2b_conv_2 (Conv2D) (None, 128, 28, 28) 147456 block_2b_relu_1[0][0]


block_2b_conv_shortcut (Conv2D) (None, 128, 28, 28) 16384 block_2a_relu[0][0]


block_2b_bn_2 (BatchNormalizati (None, 128, 28, 28) 512 block_2b_conv_2[0][0]


block_2b_bn_shortcut (BatchNorm (None, 128, 28, 28) 512 block_2b_conv_shortcut[0][0]


add_4 (Add) (None, 128, 28, 28) 0 block_2b_bn_2[0][0]
block_2b_bn_shortcut[0][0]


block_2b_relu (Activation) (None, 128, 28, 28) 0 add_4[0][0]


block_3a_conv_1 (Conv2D) (None, 256, 14, 14) 294912 block_2b_relu[0][0]


block_3a_bn_1 (BatchNormalizati (None, 256, 14, 14) 1024 block_3a_conv_1[0][0]


block_3a_relu_1 (Activation) (None, 256, 14, 14) 0 block_3a_bn_1[0][0]


block_3a_conv_2 (Conv2D) (None, 256, 14, 14) 589824 block_3a_relu_1[0][0]


block_3a_conv_shortcut (Conv2D) (None, 256, 14, 14) 32768 block_2b_relu[0][0]


block_3a_bn_2 (BatchNormalizati (None, 256, 14, 14) 1024 block_3a_conv_2[0][0]


block_3a_bn_shortcut (BatchNorm (None, 256, 14, 14) 1024 block_3a_conv_shortcut[0][0]


add_5 (Add) (None, 256, 14, 14) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]


block_3a_relu (Activation) (None, 256, 14, 14) 0 add_5[0][0]


block_3b_conv_1 (Conv2D) (None, 256, 14, 14) 589824 block_3a_relu[0][0]


block_3b_bn_1 (BatchNormalizati (None, 256, 14, 14) 1024 block_3b_conv_1[0][0]


block_3b_relu_1 (Activation) (None, 256, 14, 14) 0 block_3b_bn_1[0][0]


block_3b_conv_2 (Conv2D) (None, 256, 14, 14) 589824 block_3b_relu_1[0][0]


block_3b_conv_shortcut (Conv2D) (None, 256, 14, 14) 65536 block_3a_relu[0][0]


block_3b_bn_2 (BatchNormalizati (None, 256, 14, 14) 1024 block_3b_conv_2[0][0]


block_3b_bn_shortcut (BatchNorm (None, 256, 14, 14) 1024 block_3b_conv_shortcut[0][0]


add_6 (Add) (None, 256, 14, 14) 0 block_3b_bn_2[0][0]
block_3b_bn_shortcut[0][0]


block_3b_relu (Activation) (None, 256, 14, 14) 0 add_6[0][0]


block_4a_conv_1 (Conv2D) (None, 512, 14, 14) 1179648 block_3b_relu[0][0]


block_4a_bn_1 (BatchNormalizati (None, 512, 14, 14) 2048 block_4a_conv_1[0][0]


block_4a_relu_1 (Activation) (None, 512, 14, 14) 0 block_4a_bn_1[0][0]


block_4a_conv_2 (Conv2D) (None, 512, 14, 14) 2359296 block_4a_relu_1[0][0]


block_4a_conv_shortcut (Conv2D) (None, 512, 14, 14) 131072 block_3b_relu[0][0]


block_4a_bn_2 (BatchNormalizati (None, 512, 14, 14) 2048 block_4a_conv_2[0][0]


block_4a_bn_shortcut (BatchNorm (None, 512, 14, 14) 2048 block_4a_conv_shortcut[0][0]


add_7 (Add) (None, 512, 14, 14) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]


block_4a_relu (Activation) (None, 512, 14, 14) 0 add_7[0][0]


block_4b_conv_1 (Conv2D) (None, 512, 14, 14) 2359296 block_4a_relu[0][0]


block_4b_bn_1 (BatchNormalizati (None, 512, 14, 14) 2048 block_4b_conv_1[0][0]


block_4b_relu_1 (Activation) (None, 512, 14, 14) 0 block_4b_bn_1[0][0]


block_4b_conv_2 (Conv2D) (None, 512, 14, 14) 2359296 block_4b_relu_1[0][0]


block_4b_conv_shortcut (Conv2D) (None, 512, 14, 14) 262144 block_4a_relu[0][0]


block_4b_bn_2 (BatchNormalizati (None, 512, 14, 14) 2048 block_4b_conv_2[0][0]


block_4b_bn_shortcut (BatchNorm (None, 512, 14, 14) 2048 block_4b_conv_shortcut[0][0]


add_8 (Add) (None, 512, 14, 14) 0 block_4b_bn_2[0][0]
block_4b_bn_shortcut[0][0]


block_4b_relu (Activation) (None, 512, 14, 14) 0 add_8[0][0]


avg_pool (AveragePooling2D) (None, 512, 1, 1) 0 block_4b_relu[0][0]


flatten (Flatten) (None, 512) 0 avg_pool[0][0]


predictions (Dense) (None, 20) 10260 flatten[0][0]

Total params: 11,552,724
Trainable params: 11,376,020
Non-trainable params: 176,704


WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

2021-07-08 02:48:29,028 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

2021-07-08 02:48:29,036 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

2021-07-08 02:48:29,818 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2021-07-08 02:48:29,912 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:929: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

2021-07-08 02:48:30,520 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:929: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:931: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

2021-07-08 02:48:30,520 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:931: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Epoch 1/80
1/183 […] - ETA: 18:54 - loss: 3.8637 - acc: 0.0312WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2021-07-08 02:48:38,641 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2/183 […] - ETA: 11:06 - loss: 3.7738 - acc: 0.0391/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.524770). Check your callbacks.
% delta_t_median)
183/183 [==============================] - 28s 154ms/step - loss: 2.1876 - acc: 0.4130 - val_loss: 1.6841 - val_acc: 0.5299
8bd5b9887ac4:129:181 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
8bd5b9887ac4:129:181 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
8bd5b9887ac4:129:181 [0] NCCL INFO NET/IB : No device found.
8bd5b9887ac4:129:181 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
8bd5b9887ac4:129:181 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

8bd5b9887ac4:129:181 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0a/…/…/0000:0a:00.0
8bd5b9887ac4:129:181 [0] NCCL INFO graph/xml.cc:469 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO graph/xml.cc:660 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO graph/topo.cc:523 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO init.cc:581 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO init.cc:840 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO init.cc:876 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO init.cc:887 → 2
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 492, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 494, in return_func
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 487, in main
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 460, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 77, in _average_metrics_in_place
self.backend.get_session().run(self.allreduce_ops[metric])
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0’:
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 492, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 487, in main
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 460, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 73, in _average_metrics_in_place
self._make_variable(metric, value)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 58, in _make_variable
allreduce_op = hvd.allreduce(var, device_dense=self.device)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py”, line 80, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py”, line 86, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File “”, line 80, in horovod_allreduce
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

2021-07-08 08:19:02,268 [INFO] tlt.components.docker_handler.docker_handler: Stopping container…)

In WSL, TLT 3.0 has this issue.

For workaround, please docker pull TLT 2.0_py3 docker instead and run training. There is no issue in TLT 2.0.

docker pull nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

Do you have any link where this issue is being tracked …or any eta for its resolution ?

There are similar issues in below links.

See
NCCL tests don't work on WSL · Issue #442 · NVIDIA/nccl · GitHub and
NCCL failure : "unhandled system error" for 2 GPUs

This issue still exists for the latest TAO with all upgraded softwares and dependencies…

Epoch 1/80
  1/238 [..............................] - ETA: 35:21 - loss: 3.4665 - acc: 0.0938WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2021-09-03 03:34:33,108 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

  2/238 [..............................] - ETA: 19:56 - loss: 3.6723 - acc: 0.0625/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.548347). Check your callbacks.
  % delta_t_median)
238/238 [==============================] - 40s 169ms/step - loss: 2.1327 - acc: 0.4371 - val_loss: 1.5816 - val_acc: 0.5542
96d216ed9f8a:127:179 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.18.0.2<0>
96d216ed9f8a:127:179 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
96d216ed9f8a:127:179 [0] NCCL INFO NET/IB : No device found.
96d216ed9f8a:127:179 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.18.0.2<0>
96d216ed9f8a:127:179 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

96d216ed9f8a:127:179 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0a/../../0000:0a:00.0
96d216ed9f8a:127:179 [0] NCCL INFO graph/xml.cc:469 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO graph/xml.cc:660 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO graph/topo.cc:523 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:581 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:840 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:876 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:887 -> 2
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled system error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
  (1) Unknown: ncclCommInitRank failed: unhandled system error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
	 [[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 500, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 494, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 495, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 468, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 84, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 77, in _average_metrics_in_place
    self.backend.get_session().run(self.allreduce_ops[metric])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled system error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Unknown: ncclCommInitRank failed: unhandled system error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0':
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 500, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 495, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 468, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 84, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 73, in _average_metrics_in_place
    self._make_variable(metric, value)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 58, in _make_variable
    allreduce_op = hvd.allreduce(var, device_dense=self.device)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
    summed_tensor_compressed = _allreduce(tensor_compressed)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 80, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

2021-09-03 09:05:06,420 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Thanks for the info. I will sync with internal team.

For latest TAO docker, it can also run in WSL.

Update the nccl via https://developer.nvidia.com/nccl/nccl-download
For example, update to 2.11.4 version.
sudo apt install libnccl2=2.11.4-1+cuda11.0 libnccl-dev=2.11.4-1+cuda11.0