I am facing the below mentioned issue when executing TLT 3.0 sample notebook for classification using wsl2 Ubuntu 20.04.
Same sample notebook runs well on standalone Ubuntu 20.04
Can someone take a look at this.
• Hardware (PC : AMD RYZEN 3900X, 64 GB RAM, RTX 3090, Windows 10 INSIDER PREVIEW with WSL2, wsl2 kernel version 5.10.43.3-microsoft-standard-WSL2 , GPU driver 471.11, 470.76)
• Network Type (Classification)
• TLT Version (v3.0-py3)
• Training spec file(I have used the classification example Jupyter Notebook downloaded following this guide TLT Quick Start Guide — Transfer Learning Toolkit 3.0 documentation (nvidia.com))
• How to reproduce the issue ? Just execute the sample Jupyter Notebook for Classification under WSl2 Ubuntu 20.04 with latest docker and Nvidia container libs installed in it.
You would get the following output.
(2021-07-08 07:47:05,713 [INFO] root: Registry: [‘nvcr.io’]
2021-07-08 07:47:05,756 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn’t exist locally/the manifest has changed. Pulling a new docker.
2021-07-08 07:47:05,756 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you’re doing this for the first time. Please wait here.
…
Repository name: nvcr.io/nvidia/tlt-streamanalytics
2021-07-08 08:18:09,152 [INFO] tlt.components.docker_handler.docker_handler: Container pull complete.
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-my3ozdff because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
[‘model_config’, ‘train_config’]
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:281: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:290: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:302: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
2021-07-08 02:48:16,849 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:302: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:302: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
2021-07-08 02:48:16,849 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:302: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
Found 11667 images belonging to 20 classes.
2021-07-08 02:48:17,578 [INFO] main: Processing dataset (train): /workspace/tlt-experiments/data/split/train
Found 1670 images belonging to 20 classes.
2021-07-08 02:48:17,791 [INFO] main: Processing dataset (validation): /workspace/tlt-experiments/data/split/val
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.2021-07-08 02:48:17,791 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
2021-07-08 02:48:17,792 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
2021-07-08 02:48:17,805 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.
2021-07-08 02:48:17,809 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.
WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.
2021-07-08 02:48:18,341 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.
2021-07-08 02:48:18,890 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.
2021-07-08 02:48:18,890 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.
2021-07-08 02:48:19,104 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.
Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) (None, 3, 224, 224) 0
conv1 (Conv2D) (None, 64, 112, 112) 9408 input_1[0][0]
bn_conv1 (BatchNormalization) (None, 64, 112, 112) 256 conv1[0][0]
activation_1 (Activation) (None, 64, 112, 112) 0 bn_conv1[0][0]
block_1a_conv_1 (Conv2D) (None, 64, 56, 56) 36864 activation_1[0][0]
block_1a_bn_1 (BatchNormalizati (None, 64, 56, 56) 256 block_1a_conv_1[0][0]
block_1a_relu_1 (Activation) (None, 64, 56, 56) 0 block_1a_bn_1[0][0]
block_1a_conv_2 (Conv2D) (None, 64, 56, 56) 36864 block_1a_relu_1[0][0]
block_1a_conv_shortcut (Conv2D) (None, 64, 56, 56) 4096 activation_1[0][0]
block_1a_bn_2 (BatchNormalizati (None, 64, 56, 56) 256 block_1a_conv_2[0][0]
block_1a_bn_shortcut (BatchNorm (None, 64, 56, 56) 256 block_1a_conv_shortcut[0][0]
add_1 (Add) (None, 64, 56, 56) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]
block_1a_relu (Activation) (None, 64, 56, 56) 0 add_1[0][0]
block_1b_conv_1 (Conv2D) (None, 64, 56, 56) 36864 block_1a_relu[0][0]
block_1b_bn_1 (BatchNormalizati (None, 64, 56, 56) 256 block_1b_conv_1[0][0]
block_1b_relu_1 (Activation) (None, 64, 56, 56) 0 block_1b_bn_1[0][0]
block_1b_conv_2 (Conv2D) (None, 64, 56, 56) 36864 block_1b_relu_1[0][0]
block_1b_conv_shortcut (Conv2D) (None, 64, 56, 56) 4096 block_1a_relu[0][0]
block_1b_bn_2 (BatchNormalizati (None, 64, 56, 56) 256 block_1b_conv_2[0][0]
block_1b_bn_shortcut (BatchNorm (None, 64, 56, 56) 256 block_1b_conv_shortcut[0][0]
add_2 (Add) (None, 64, 56, 56) 0 block_1b_bn_2[0][0]
block_1b_bn_shortcut[0][0]
block_1b_relu (Activation) (None, 64, 56, 56) 0 add_2[0][0]
block_2a_conv_1 (Conv2D) (None, 128, 28, 28) 73728 block_1b_relu[0][0]
block_2a_bn_1 (BatchNormalizati (None, 128, 28, 28) 512 block_2a_conv_1[0][0]
block_2a_relu_1 (Activation) (None, 128, 28, 28) 0 block_2a_bn_1[0][0]
block_2a_conv_2 (Conv2D) (None, 128, 28, 28) 147456 block_2a_relu_1[0][0]
block_2a_conv_shortcut (Conv2D) (None, 128, 28, 28) 8192 block_1b_relu[0][0]
block_2a_bn_2 (BatchNormalizati (None, 128, 28, 28) 512 block_2a_conv_2[0][0]
block_2a_bn_shortcut (BatchNorm (None, 128, 28, 28) 512 block_2a_conv_shortcut[0][0]
add_3 (Add) (None, 128, 28, 28) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]
block_2a_relu (Activation) (None, 128, 28, 28) 0 add_3[0][0]
block_2b_conv_1 (Conv2D) (None, 128, 28, 28) 147456 block_2a_relu[0][0]
block_2b_bn_1 (BatchNormalizati (None, 128, 28, 28) 512 block_2b_conv_1[0][0]
block_2b_relu_1 (Activation) (None, 128, 28, 28) 0 block_2b_bn_1[0][0]
block_2b_conv_2 (Conv2D) (None, 128, 28, 28) 147456 block_2b_relu_1[0][0]
block_2b_conv_shortcut (Conv2D) (None, 128, 28, 28) 16384 block_2a_relu[0][0]
block_2b_bn_2 (BatchNormalizati (None, 128, 28, 28) 512 block_2b_conv_2[0][0]
block_2b_bn_shortcut (BatchNorm (None, 128, 28, 28) 512 block_2b_conv_shortcut[0][0]
add_4 (Add) (None, 128, 28, 28) 0 block_2b_bn_2[0][0]
block_2b_bn_shortcut[0][0]
block_2b_relu (Activation) (None, 128, 28, 28) 0 add_4[0][0]
block_3a_conv_1 (Conv2D) (None, 256, 14, 14) 294912 block_2b_relu[0][0]
block_3a_bn_1 (BatchNormalizati (None, 256, 14, 14) 1024 block_3a_conv_1[0][0]
block_3a_relu_1 (Activation) (None, 256, 14, 14) 0 block_3a_bn_1[0][0]
block_3a_conv_2 (Conv2D) (None, 256, 14, 14) 589824 block_3a_relu_1[0][0]
block_3a_conv_shortcut (Conv2D) (None, 256, 14, 14) 32768 block_2b_relu[0][0]
block_3a_bn_2 (BatchNormalizati (None, 256, 14, 14) 1024 block_3a_conv_2[0][0]
block_3a_bn_shortcut (BatchNorm (None, 256, 14, 14) 1024 block_3a_conv_shortcut[0][0]
add_5 (Add) (None, 256, 14, 14) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]
block_3a_relu (Activation) (None, 256, 14, 14) 0 add_5[0][0]
block_3b_conv_1 (Conv2D) (None, 256, 14, 14) 589824 block_3a_relu[0][0]
block_3b_bn_1 (BatchNormalizati (None, 256, 14, 14) 1024 block_3b_conv_1[0][0]
block_3b_relu_1 (Activation) (None, 256, 14, 14) 0 block_3b_bn_1[0][0]
block_3b_conv_2 (Conv2D) (None, 256, 14, 14) 589824 block_3b_relu_1[0][0]
block_3b_conv_shortcut (Conv2D) (None, 256, 14, 14) 65536 block_3a_relu[0][0]
block_3b_bn_2 (BatchNormalizati (None, 256, 14, 14) 1024 block_3b_conv_2[0][0]
block_3b_bn_shortcut (BatchNorm (None, 256, 14, 14) 1024 block_3b_conv_shortcut[0][0]
add_6 (Add) (None, 256, 14, 14) 0 block_3b_bn_2[0][0]
block_3b_bn_shortcut[0][0]
block_3b_relu (Activation) (None, 256, 14, 14) 0 add_6[0][0]
block_4a_conv_1 (Conv2D) (None, 512, 14, 14) 1179648 block_3b_relu[0][0]
block_4a_bn_1 (BatchNormalizati (None, 512, 14, 14) 2048 block_4a_conv_1[0][0]
block_4a_relu_1 (Activation) (None, 512, 14, 14) 0 block_4a_bn_1[0][0]
block_4a_conv_2 (Conv2D) (None, 512, 14, 14) 2359296 block_4a_relu_1[0][0]
block_4a_conv_shortcut (Conv2D) (None, 512, 14, 14) 131072 block_3b_relu[0][0]
block_4a_bn_2 (BatchNormalizati (None, 512, 14, 14) 2048 block_4a_conv_2[0][0]
block_4a_bn_shortcut (BatchNorm (None, 512, 14, 14) 2048 block_4a_conv_shortcut[0][0]
add_7 (Add) (None, 512, 14, 14) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]
block_4a_relu (Activation) (None, 512, 14, 14) 0 add_7[0][0]
block_4b_conv_1 (Conv2D) (None, 512, 14, 14) 2359296 block_4a_relu[0][0]
block_4b_bn_1 (BatchNormalizati (None, 512, 14, 14) 2048 block_4b_conv_1[0][0]
block_4b_relu_1 (Activation) (None, 512, 14, 14) 0 block_4b_bn_1[0][0]
block_4b_conv_2 (Conv2D) (None, 512, 14, 14) 2359296 block_4b_relu_1[0][0]
block_4b_conv_shortcut (Conv2D) (None, 512, 14, 14) 262144 block_4a_relu[0][0]
block_4b_bn_2 (BatchNormalizati (None, 512, 14, 14) 2048 block_4b_conv_2[0][0]
block_4b_bn_shortcut (BatchNorm (None, 512, 14, 14) 2048 block_4b_conv_shortcut[0][0]
add_8 (Add) (None, 512, 14, 14) 0 block_4b_bn_2[0][0]
block_4b_bn_shortcut[0][0]
block_4b_relu (Activation) (None, 512, 14, 14) 0 add_8[0][0]
avg_pool (AveragePooling2D) (None, 512, 1, 1) 0 block_4b_relu[0][0]
flatten (Flatten) (None, 512) 0 avg_pool[0][0]
predictions (Dense) (None, 20) 10260 flatten[0][0]
Total params: 11,552,724
Trainable params: 11,376,020
Non-trainable params: 176,704
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
2021-07-08 02:48:29,028 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.
2021-07-08 02:48:29,036 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.
2021-07-08 02:48:29,818 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
2021-07-08 02:48:29,912 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:929: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.
2021-07-08 02:48:30,520 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:929: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:931: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
2021-07-08 02:48:30,520 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:931: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
Epoch 1/80
1/183 […] - ETA: 18:54 - loss: 3.8637 - acc: 0.0312WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.2021-07-08 02:48:38,641 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.
2/183 […] - ETA: 11:06 - loss: 3.7738 - acc: 0.0391/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.524770). Check your callbacks.
% delta_t_median)
183/183 [==============================] - 28s 154ms/step - loss: 2.1876 - acc: 0.4130 - val_loss: 1.6841 - val_acc: 0.5299
8bd5b9887ac4:129:181 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
8bd5b9887ac4:129:181 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
8bd5b9887ac4:129:181 [0] NCCL INFO NET/IB : No device found.
8bd5b9887ac4:129:181 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
8bd5b9887ac4:129:181 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.18bd5b9887ac4:129:181 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0a/…/…/0000:0a:00.0
8bd5b9887ac4:129:181 [0] NCCL INFO graph/xml.cc:469 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO graph/xml.cc:660 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO graph/topo.cc:523 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO init.cc:581 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO init.cc:840 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO init.cc:876 → 2
8bd5b9887ac4:129:181 [0] NCCL INFO init.cc:887 → 2
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 492, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 494, in return_func
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 487, in main
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 460, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 77, in _average_metrics_in_place
self.backend.get_session().run(self.allreduce_ops[metric])
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.Original stack trace for ‘MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0’:
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 492, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 487, in main
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 460, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/keras/callbacks.py”, line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 84, in on_epoch_end
self._average_metrics_in_place(logs)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 73, in _average_metrics_in_place
self._make_variable(metric, value)
File “/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py”, line 58, in _make_variable
allreduce_op = hvd.allreduce(var, device_dense=self.device)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py”, line 80, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py”, line 86, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File “”, line 80, in horovod_allreduce
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()2021-07-08 08:19:02,268 [INFO] tlt.components.docker_handler.docker_handler: Stopping container…)