Yolov3 training error: No execution plan worked!

Hi, I get the following error during yolov3 training:

File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/train.py", line 77, in run_experiment
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/models/yolov3_model.py", line 642, in train
File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator
class_weight=class_weight)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: No execution plan worked!
[[{{node conv1_1/convolution}}]]
[[loss_1/add_48/_5949]]
(1) Not found: No execution plan worked!
[[{{node conv1_1/convolution}}]]
0 successful operations.
0 derived errors ignored.
Using TensorFlow backend.

Did you ever run the default yolov3 jupyter notebook to check if it can work?

Yes, sometimes train is successful, but sometimes it gives an error.
Is this error similar to this? SSD inference error and should I set TF_FORCE_GPU_ALLOW_GROWTH=true ?
Or is it because the version of CUDA and Cudnn are not compatible?
what version of cuda and cudnn is the TAO container compiled with?

May I know which gpu device did you run? And can you share the info of
$ nvidia-smi
$ dpkg -l |grep cuda

I use NVIDIA GeForce RTX 3080

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 37%   70C    P2    99W / 320W |   9213MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:21:00.0 Off |                  N/A |
| 30%   33C    P8    17W / 320W |      3MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:4B:00.0 Off |                  N/A |
| 30%   30C    P8    23W / 320W |      3MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:4C:00.0 Off |                  N/A |
| 32%   50C    P8    28W / 320W |      3MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

$ dpkg -l |grep cuda

ii  cuda-cccl-11-6                  11.6.55-1                         amd64        CUDA CCCL
ii  cuda-compat-11-6                510.39.01-1                       amd64        CUDA Compatibility Platform
ii  cuda-cudart-11-6                11.6.55-1                         amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-11-6            11.6.55-1                         amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-11-6             11.6.55-1                         amd64        CUDA cuobjdump
ii  cuda-cupti-11-6                 11.6.55-1                         amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-11-6             11.6.55-1                         amd64        CUDA profiling tools interface.
ii  cuda-driver-dev-11-6            11.6.55-1                         amd64        CUDA Driver native dev stub library
ii  cuda-gdb-11-6                   11.6.55-1                         amd64        CUDA-GDB
ii  cuda-memcheck-11-6              11.6.55-1                         amd64        CUDA-MEMCHECK
ii  cuda-nvcc-11-6                  11.6.55-1                         amd64        CUDA nvcc
ii  cuda-nvdisasm-11-6              11.6.55-1                         amd64        CUDA disassembler
ii  cuda-nvml-dev-11-6              11.6.55-1                         amd64        NVML native dev links, headers
ii  cuda-nvprof-11-6                11.6.55-1                         amd64        CUDA Profiler tools
ii  cuda-nvprune-11-6               11.6.55-1                         amd64        CUDA nvprune
ii  cuda-nvrtc-11-4                 11.4.120-1                        amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-11-6                 11.6.55-1                         amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-11-4             11.4.120-1                        amd64        NVRTC native dev links, headers
ii  cuda-nvrtc-dev-11-6             11.6.55-1                         amd64        NVRTC native dev links, headers
ii  cuda-nvtx-11-6                  11.6.55-1                         amd64        NVIDIA Tools Extension
ii  cuda-sanitizer-11-6             11.6.55-1                         amd64        CUDA Sanitizer
ii  cuda-toolkit-11-4-config-common 11.4.108-1                        all          Common config package for CUDA Toolkit 11.4.
ii  cuda-toolkit-11-6-config-common 11.6.55-1                         all          Common config package for CUDA Toolkit 11.6.
ii  cuda-toolkit-11-config-common   11.6.55-1                         all          Common config package for CUDA Toolkit 11.
ii  cuda-toolkit-config-common      11.6.55-1                         all          Common config package for CUDA Toolkit.
ii  libcudnn8                       8.3.2.44-1+cuda11.5               amd64        cuDNN runtime libraries
ii  libcudnn8-dev                   8.3.2.44-1+cuda11.5               amd64        cuDNN development libraries and headers
ii  libnccl-dev                     2.11.4-1+cuda11.6                 amd64        NVIDIA Collective Communication Library(NCCL) Development Files
ii  libnccl2                        2.11.4-1+cuda11.6                 amd64        NVIDIA Collective Communication Library(NCCL) Runtime
ii  libnvinfer-bin                  8.2.5-1+cuda11.4                  amd64        TensorRT binaries
ii  libnvinfer-dev                  8.2.5-1+cuda11.4                  amd64        TensorRT development libraries and headers
ii  libnvinfer-plugin-dev           8.2.5-1+cuda11.4                  amd64        TensorRT plugin libraries
ii  libnvinfer-plugin8              8.2.5-1+cuda11.4                  amd64        TensorRT plugin libraries
ii  libnvinfer8                     8.2.5-1+cuda11.4                  amd64        TensorRT runtime libraries
ii  libnvonnxparsers-dev            8.2.5-1+cuda11.4                  amd64        TensorRT ONNX libraries
ii  libnvonnxparsers8               8.2.5-1+cuda11.4                  amd64        TensorRT ONNX libraries
ii  libnvparsers-dev                8.2.5-1+cuda11.4                  amd64        TensorRT parsers libraries
ii  libnvparsers8                   8.2.5-1+cuda11.4                  amd64        TensorRT parsers libraries
ii  python3-libnvinfer              8.2.5-1+cuda11.4                  amd64        Python 3 bindings for TensorRT
ii  python3-libnvinfer-dev          8.2.5-1+cuda11.4                  amd64        Python 3 development package for TensorRT

Did you train with 1 gpu or 4gpus?
Could you share the training command and full log?

I train with 1 gpu.
training command:

yolo_v3 train -e /workspace/results/yolov3_pv/unpruned_model/final_spec.txt -r /workspace/results/yolov3_pv/unpruned_model -k $KEY --gpus 1 --gpu_index 0

full log:

Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/train.py:42: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/train.py:42: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/train.py:45: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/train.py:45: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING: From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
warnings.warn('No training configuration found in save file: '
INFO: Log file already exists at /tmp/2440/unpruned_model/status.json
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
Input (InputLayer)              (None, 3, None, None 0
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, None, Non 9408        Input[0][0]
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, None, Non 256         conv1[0][0]
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 64, None, Non 0           bn_conv1[0][0]
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, None, Non 36864       activation_2[0][0]
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, None, Non 256         block_1a_conv_1[0][0]
__________________________________________________________________________________________________
block_1a_relu_1 (Activation)    (None, 64, None, Non 0           block_1a_bn_1[0][0]
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, None, Non 36864       block_1a_relu_1[0][0]
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, None, Non 4096        activation_2[0][0]
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, None, Non 256         block_1a_conv_2[0][0]
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 64, None, Non 256         block_1a_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_9 (Add)                     (None, 64, None, Non 0           block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_1a_relu (Activation)      (None, 64, None, Non 0           add_9[0][0]
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D)        (None, 64, None, Non 36864       block_1a_relu[0][0]
__________________________________________________________________________________________________
block_1b_bn_1 (BatchNormalizati (None, 64, None, Non 256         block_1b_conv_1[0][0]
__________________________________________________________________________________________________
block_1b_relu_1 (Activation)    (None, 64, None, Non 0           block_1b_bn_1[0][0]
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D)        (None, 64, None, Non 36864       block_1b_relu_1[0][0]
__________________________________________________________________________________________________
block_1b_conv_shortcut (Conv2D) (None, 64, None, Non 4096        block_1a_relu[0][0]
__________________________________________________________________________________________________
block_1b_bn_2 (BatchNormalizati (None, 64, None, Non 256         block_1b_conv_2[0][0]
__________________________________________________________________________________________________
block_1b_bn_shortcut (BatchNorm (None, 64, None, Non 256         block_1b_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_10 (Add)                    (None, 64, None, Non 0           block_1b_bn_2[0][0]
block_1b_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_1b_relu (Activation)      (None, 64, None, Non 0           add_10[0][0]
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, None, No 73728       block_1b_relu[0][0]
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, None, No 512         block_2a_conv_1[0][0]
__________________________________________________________________________________________________
block_2a_relu_1 (Activation)    (None, 128, None, No 0           block_2a_bn_1[0][0]
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, None, No 147456      block_2a_relu_1[0][0]
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, None, No 8192        block_1b_relu[0][0]
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, None, No 512         block_2a_conv_2[0][0]
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 128, None, No 512         block_2a_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_11 (Add)                    (None, 128, None, No 0           block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_2a_relu (Activation)      (None, 128, None, No 0           add_11[0][0]
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D)        (None, 128, None, No 147456      block_2a_relu[0][0]
__________________________________________________________________________________________________
block_2b_bn_1 (BatchNormalizati (None, 128, None, No 512         block_2b_conv_1[0][0]
__________________________________________________________________________________________________
block_2b_relu_1 (Activation)    (None, 128, None, No 0           block_2b_bn_1[0][0]
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D)        (None, 128, None, No 147456      block_2b_relu_1[0][0]
__________________________________________________________________________________________________
block_2b_conv_shortcut (Conv2D) (None, 128, None, No 16384       block_2a_relu[0][0]
__________________________________________________________________________________________________
block_2b_bn_2 (BatchNormalizati (None, 128, None, No 512         block_2b_conv_2[0][0]
__________________________________________________________________________________________________
block_2b_bn_shortcut (BatchNorm (None, 128, None, No 512         block_2b_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_12 (Add)                    (None, 128, None, No 0           block_2b_bn_2[0][0]
block_2b_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_2b_relu (Activation)      (None, 128, None, No 0           add_12[0][0]
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, None, No 294912      block_2b_relu[0][0]
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, None, No 1024        block_3a_conv_1[0][0]
__________________________________________________________________________________________________
block_3a_relu_1 (Activation)    (None, 256, None, No 0           block_3a_bn_1[0][0]
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, None, No 589824      block_3a_relu_1[0][0]
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, None, No 32768       block_2b_relu[0][0]
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, None, No 1024        block_3a_conv_2[0][0]
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, None, No 1024        block_3a_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_13 (Add)                    (None, 256, None, No 0           block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_3a_relu (Activation)      (None, 256, None, No 0           add_13[0][0]
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (None, 256, None, No 589824      block_3a_relu[0][0]
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (None, 256, None, No 1024        block_3b_conv_1[0][0]
__________________________________________________________________________________________________
block_3b_relu_1 (Activation)    (None, 256, None, No 0           block_3b_bn_1[0][0]
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (None, 256, None, No 589824      block_3b_relu_1[0][0]
__________________________________________________________________________________________________
block_3b_conv_shortcut (Conv2D) (None, 256, None, No 65536       block_3a_relu[0][0]
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (None, 256, None, No 1024        block_3b_conv_2[0][0]
__________________________________________________________________________________________________
block_3b_bn_shortcut (BatchNorm (None, 256, None, No 1024        block_3b_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_14 (Add)                    (None, 256, None, No 0           block_3b_bn_2[0][0]
block_3b_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_3b_relu (Activation)      (None, 256, None, No 0           add_14[0][0]
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, None, No 1179648     block_3b_relu[0][0]
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, None, No 2048        block_4a_conv_1[0][0]
__________________________________________________________________________________________________
block_4a_relu_1 (Activation)    (None, 512, None, No 0           block_4a_bn_1[0][0]
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, None, No 2359296     block_4a_relu_1[0][0]
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, None, No 131072      block_3b_relu[0][0]
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, None, No 2048        block_4a_conv_2[0][0]
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 512, None, No 2048        block_4a_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_15 (Add)                    (None, 512, None, No 0           block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_4a_relu (Activation)      (None, 512, None, No 0           add_15[0][0]
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (None, 512, None, No 2359296     block_4a_relu[0][0]
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (None, 512, None, No 2048        block_4b_conv_1[0][0]
__________________________________________________________________________________________________
block_4b_relu_1 (Activation)    (None, 512, None, No 0           block_4b_bn_1[0][0]
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (None, 512, None, No 2359296     block_4b_relu_1[0][0]
__________________________________________________________________________________________________
block_4b_conv_shortcut (Conv2D) (None, 512, None, No 262144      block_4a_relu[0][0]
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (None, 512, None, No 2048        block_4b_conv_2[0][0]
__________________________________________________________________________________________________
block_4b_bn_shortcut (BatchNorm (None, 512, None, No 2048        block_4b_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_16 (Add)                    (None, 512, None, No 0           block_4b_bn_2[0][0]
block_4b_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_4b_relu (Activation)      (None, 512, None, No 0           add_16[0][0]
__________________________________________________________________________________________________
yolo_expand_conv1 (Conv2D)      (None, 512, None, No 2359296     block_4b_relu[0][0]
__________________________________________________________________________________________________
yolo_expand_conv1_bn (BatchNorm (None, 512, None, No 2048        yolo_expand_conv1[0][0]
__________________________________________________________________________________________________
yolo_expand_conv1_lrelu (LeakyR (None, 512, None, No 0           yolo_expand_conv1_bn[0][0]
__________________________________________________________________________________________________
yolo_conv1_1 (Conv2D)           (None, 256, None, No 131072      yolo_expand_conv1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv1_1_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv1_1[0][0]
__________________________________________________________________________________________________
yolo_conv1_1_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv1_1_bn[0][0]
__________________________________________________________________________________________________
yolo_conv1_2 (Conv2D)           (None, 512, None, No 1179648     yolo_conv1_1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv1_2_bn (BatchNormaliza (None, 512, None, No 2048        yolo_conv1_2[0][0]
__________________________________________________________________________________________________
yolo_conv1_2_lrelu (LeakyReLU)  (None, 512, None, No 0           yolo_conv1_2_bn[0][0]
__________________________________________________________________________________________________
yolo_conv1_3 (Conv2D)           (None, 256, None, No 131072      yolo_conv1_2_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv1_3_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv1_3[0][0]
__________________________________________________________________________________________________
yolo_conv1_3_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv1_3_bn[0][0]
__________________________________________________________________________________________________
yolo_conv1_4 (Conv2D)           (None, 512, None, No 1179648     yolo_conv1_3_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv1_4_bn (BatchNormaliza (None, 512, None, No 2048        yolo_conv1_4[0][0]
__________________________________________________________________________________________________
yolo_conv1_4_lrelu (LeakyReLU)  (None, 512, None, No 0           yolo_conv1_4_bn[0][0]
__________________________________________________________________________________________________
yolo_conv1_5 (Conv2D)           (None, 256, None, No 131072      yolo_conv1_4_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv1_5_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv1_5[0][0]
__________________________________________________________________________________________________
yolo_conv1_5_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv1_5_bn[0][0]
__________________________________________________________________________________________________
yolo_conv2 (Conv2D)             (None, 128, None, No 32768       yolo_conv1_5_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv2_bn (BatchNormalizati (None, 128, None, No 512         yolo_conv2[0][0]
__________________________________________________________________________________________________
yolo_conv2_lrelu (LeakyReLU)    (None, 128, None, No 0           yolo_conv2_bn[0][0]
__________________________________________________________________________________________________
upsample0 (UpSampling2D)        (None, 128, None, No 0           yolo_conv2_lrelu[0][0]
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, 384, None, No 0           upsample0[0][0]
block_3b_relu[0][0]
__________________________________________________________________________________________________
yolo_conv3_1 (Conv2D)           (None, 128, None, No 49152       concatenate_3[0][0]
__________________________________________________________________________________________________
yolo_conv3_1_bn (BatchNormaliza (None, 128, None, No 512         yolo_conv3_1[0][0]
__________________________________________________________________________________________________
yolo_conv3_1_lrelu (LeakyReLU)  (None, 128, None, No 0           yolo_conv3_1_bn[0][0]
__________________________________________________________________________________________________
yolo_conv3_2 (Conv2D)           (None, 256, None, No 294912      yolo_conv3_1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv3_2_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv3_2[0][0]
__________________________________________________________________________________________________
yolo_conv3_2_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv3_2_bn[0][0]
__________________________________________________________________________________________________
yolo_conv3_3 (Conv2D)           (None, 128, None, No 32768       yolo_conv3_2_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv3_3_bn (BatchNormaliza (None, 128, None, No 512         yolo_conv3_3[0][0]
__________________________________________________________________________________________________
yolo_conv3_3_lrelu (LeakyReLU)  (None, 128, None, No 0           yolo_conv3_3_bn[0][0]
__________________________________________________________________________________________________
yolo_conv3_4 (Conv2D)           (None, 256, None, No 294912      yolo_conv3_3_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv3_4_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv3_4[0][0]
__________________________________________________________________________________________________
yolo_conv3_4_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv3_4_bn[0][0]
__________________________________________________________________________________________________
yolo_conv3_5 (Conv2D)           (None, 128, None, No 32768       yolo_conv3_4_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv3_5_bn (BatchNormaliza (None, 128, None, No 512         yolo_conv3_5[0][0]
__________________________________________________________________________________________________
yolo_conv3_5_lrelu (LeakyReLU)  (None, 128, None, No 0           yolo_conv3_5_bn[0][0]
__________________________________________________________________________________________________
yolo_conv4 (Conv2D)             (None, 64, None, Non 8192        yolo_conv3_5_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv4_bn (BatchNormalizati (None, 64, None, Non 256         yolo_conv4[0][0]
__________________________________________________________________________________________________
yolo_conv4_lrelu (LeakyReLU)    (None, 64, None, Non 0           yolo_conv4_bn[0][0]
__________________________________________________________________________________________________
upsample1 (UpSampling2D)        (None, 64, None, Non 0           yolo_conv4_lrelu[0][0]
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 192, None, No 0           upsample1[0][0]
block_2b_relu[0][0]
__________________________________________________________________________________________________
yolo_conv5_1 (Conv2D)           (None, 64, None, Non 12288       concatenate_4[0][0]
__________________________________________________________________________________________________
yolo_conv5_1_bn (BatchNormaliza (None, 64, None, Non 256         yolo_conv5_1[0][0]
__________________________________________________________________________________________________
yolo_conv5_1_lrelu (LeakyReLU)  (None, 64, None, Non 0           yolo_conv5_1_bn[0][0]
__________________________________________________________________________________________________
yolo_conv5_2 (Conv2D)           (None, 128, None, No 73728       yolo_conv5_1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv5_2_bn (BatchNormaliza (None, 128, None, No 512         yolo_conv5_2[0][0]
__________________________________________________________________________________________________
yolo_conv5_2_lrelu (LeakyReLU)  (None, 128, None, No 0           yolo_conv5_2_bn[0][0]
__________________________________________________________________________________________________
yolo_conv5_3 (Conv2D)           (None, 64, None, Non 8192        yolo_conv5_2_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv5_3_bn (BatchNormaliza (None, 64, None, Non 256         yolo_conv5_3[0][0]
__________________________________________________________________________________________________
yolo_conv5_3_lrelu (LeakyReLU)  (None, 64, None, Non 0           yolo_conv5_3_bn[0][0]
__________________________________________________________________________________________________
yolo_conv5_4 (Conv2D)           (None, 128, None, No 73728       yolo_conv5_3_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv5_4_bn (BatchNormaliza (None, 128, None, No 512         yolo_conv5_4[0][0]
__________________________________________________________________________________________________
yolo_conv5_4_lrelu (LeakyReLU)  (None, 128, None, No 0           yolo_conv5_4_bn[0][0]
__________________________________________________________________________________________________
yolo_conv5_5 (Conv2D)           (None, 64, None, Non 8192        yolo_conv5_4_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv5_5_bn (BatchNormaliza (None, 64, None, Non 256         yolo_conv5_5[0][0]
__________________________________________________________________________________________________
yolo_conv5_5_lrelu (LeakyReLU)  (None, 64, None, Non 0           yolo_conv5_5_bn[0][0]
__________________________________________________________________________________________________
yolo_conv1_6 (Conv2D)           (None, 512, None, No 1179648     yolo_conv1_5_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv3_6 (Conv2D)           (None, 256, None, No 294912      yolo_conv3_5_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv5_6 (Conv2D)           (None, 128, None, No 73728       yolo_conv5_5_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv1_6_bn (BatchNormaliza (None, 512, None, No 2048        yolo_conv1_6[0][0]
__________________________________________________________________________________________________
yolo_conv3_6_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv3_6[0][0]
__________________________________________________________________________________________________
yolo_conv5_6_bn (BatchNormaliza (None, 128, None, No 512         yolo_conv5_6[0][0]
__________________________________________________________________________________________________
yolo_conv1_6_lrelu (LeakyReLU)  (None, 512, None, No 0           yolo_conv1_6_bn[0][0]
__________________________________________________________________________________________________
yolo_conv3_6_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv3_6_bn[0][0]
__________________________________________________________________________________________________
yolo_conv5_6_lrelu (LeakyReLU)  (None, 128, None, No 0           yolo_conv5_6_bn[0][0]
__________________________________________________________________________________________________
conv_big_object (Conv2D)        (None, 24, None, Non 12312       yolo_conv1_6_lrelu[0][0]
__________________________________________________________________________________________________
conv_mid_object (Conv2D)        (None, 24, None, Non 6168        yolo_conv3_6_lrelu[0][0]
__________________________________________________________________________________________________
conv_sm_object (Conv2D)         (None, 24, None, Non 3096        yolo_conv5_6_lrelu[0][0]
__________________________________________________________________________________________________
bg_permute (Permute)            (None, None, None, 2 0           conv_big_object[0][0]
__________________________________________________________________________________________________
md_permute (Permute)            (None, None, None, 2 0           conv_mid_object[0][0]
__________________________________________________________________________________________________
sm_permute (Permute)            (None, None, None, 2 0           conv_sm_object[0][0]
__________________________________________________________________________________________________
bg_anchor (YOLOAnchorBox)       (None, None, 6)      0           conv_big_object[0][0]
__________________________________________________________________________________________________
bg_reshape (Reshape)            (None, None, 8)      0           bg_permute[0][0]
__________________________________________________________________________________________________
md_anchor (YOLOAnchorBox)       (None, None, 6)      0           conv_mid_object[0][0]
__________________________________________________________________________________________________
md_reshape (Reshape)            (None, None, 8)      0           md_permute[0][0]
__________________________________________________________________________________________________
sm_anchor (YOLOAnchorBox)       (None, None, 6)      0           conv_sm_object[0][0]
__________________________________________________________________________________________________
sm_reshape (Reshape)            (None, None, 8)      0           sm_permute[0][0]
__________________________________________________________________________________________________
encoded_bg (Concatenate)        (None, None, 14)     0           bg_anchor[0][0]
bg_reshape[0][0]
__________________________________________________________________________________________________
encoded_md (Concatenate)        (None, None, 14)     0           md_anchor[0][0]
md_reshape[0][0]
__________________________________________________________________________________________________
encoded_sm (Concatenate)        (None, None, 14)     0           sm_anchor[0][0]
sm_reshape[0][0]
__________________________________________________________________________________________________WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:7: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:7: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:8: The name tf.tables_initializer is deprecated. Please use tf.compat.v1.tables_initializer instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:8: The name tf.tables_initializer is deprecated. Please use tf.compat.v1.tables_initializer instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

INFO: Starting Training Loop.

encoded_detections (Concatenate (None, None, 14)     0           encoded_bg[0][0]
encoded_md[0][0]
encoded_sm[0][0]
==================================================================================================
Total params: 19,164,680
Trainable params: 19,143,560
Non-trainable params: 21,120
__________________________________________________________________________________________________
Epoch 1/100
INFO: 2 root error(s) found.
(0) Not found: No execution plan worked!
[[{{node conv1_1/convolution}}]]
[[loss_1/add_48/_5949]]
(1) Not found: No execution plan worked!
[[{{node conv1_1/convolution}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/train.py", line 145, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 707, in return_func
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 695, in return_func
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/train.py", line 141, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/train.py", line 126, in main[2022-09-04 
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/train.py", line 77, in run_experiment
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/models/yolov3_model.py", line 642, in train
File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator
class_weight=class_weight)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: No execution plan worked!
[[{{node conv1_1/convolution}}]]
[[loss_1/add_48/_5949]]
(1) Not found: No execution plan worked!
[[{{node conv1_1/convolution}}]]
0 successful operations.
0 derived errors ignored.
Using TensorFlow backend.

How did you trigger the docker? Use below, right?

$ tao yolo_v3

I run the command inside the container:
$ yolo_v3 train

I know, but I want to know how did you trigger the docker?
Via "$ docker run xxx " or "$ tao yolo_v3 run /bin/bash "?

via docker run

Could you please share the full command?

docker run -d --gpus all --ipc=host --network host --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Could you add below in the command and retry?
--runtime=nvidia

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.