Enviromental variables and docker mount error for transfer laerning using yolov4

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
T4
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Yolo_v4
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
docker_tag: v3.21.08-py3
• Training spec file(If have, please share here)
Problem: I would like to use transfer learning and retrain the yolo_v4 for my custom dataset. and I am receiving an error regarding the mapping issue to docker.
the training image and labels are in the following directory:
~/cv_samples_v1.2.0/data/training
The Yolo_v4 ipython notebook is on :
:~/cv_samples_v1.2.0/yolo_v4
The training images and labels have got the shared id but they are not ordered ( ex. image 0, image 3, image5 and image 0.txt, image 3.txt) in a sequence form of the kitti format.
Does the input image size is important? the images all are in the same sizes

random_seed: 42
yolov4_config {
big_anchor_shape: “[(65.58, 144.56),(167.02, 91.94),(281.86, 179.56)]”
mid_anchor_shape: “[(49.35, 40.93),(74.40, 28.85),(96.41, 56.22) ]”
small_anchor_shape: “[(16.79, 15.72),(33.54, 23.51),(23.51, 59.22)]”
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “resnet”
nlayers: 18
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 8
num_epochs: 80
enable_qat: true
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “~/cv_samples_v1.2.0/experiment/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1248
output_height: 384
output_channel: 3
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
label_directory_path: “/home/ubuntu/cv_samples_v1.2.0/training/data/labels”
image_directory_path: “/home/ubuntu/cv_samples_v1.2.0/training/data/images”
}
include_difficult_in_training: true
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “pedestrian”
value: “pedestrian”
}
target_class_mapping {
key: “cyclist”
value: “cyclist”
}
target_class_mapping {
key: “Car”
value: “car”
}
target_class_mapping {
key: “person_sitting”
value: “pedestrian”
}
target_class_mapping {
key: “HVAC Unit”
value: “HVAC Unit”
}
target_class_mapping {
key: “Person”
value: “Person”
}
validation_data_sources: {
label_directory_path: “/workspace/tao-experiments/data/val/label”
image_directory_path: “/workspace/tao-experiments/data/val/image”
}
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)

print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti_seq.txt \
                   -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                   -k $KEY \
                   --gpus 1

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2021-09-29 22:10:10,448 [INFO] root: Registry: [‘nvcr.io’]
2021-09-29 22:10:10,524 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ubuntu/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2021-09-29 22:10:16,690 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2021-09-29 22:10:16,691 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:40: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2021-09-29 22:10:16,744 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:40: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:43: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-09-29 22:10:16,745 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:43: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2021-09-29 22:10:17,197 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2021-09-29 22:10:17,199 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2021-09-29 22:10:17,217 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

2021-09-29 22:10:17,806 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

2021-09-29 22:10:18,031 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2021-09-29 22:10:18,394 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2021-09-29 22:10:18,394 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2021-09-29 22:10:18,786 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 110, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 494, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 482, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 106, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 58, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/utils.py”, line 57, in build_training_pipeline
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/dataio/data_sequence.py”, line 18, in init
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/dataio/detection_data_sequence.py”, line 52, in init
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/dataio/detection_data_sequence.py”, line 73, in _add_source
FileNotFoundError: [Errno 2] No such file or directory: ‘/home/ubuntu/cv_samples_v1.2.0/training/data/images’
2021-09-29 22:11:21,462 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

#------------ ----------- defenition of environmental variable and mapping to docker

# Setting up env variables for cleaner command line commands.
import os

print("Please replace the variable with your key.")
%env KEY=nvidia_tlt
# %env USER_EXPERIMENT_DIR=/workspace/tao-experiments/yolo_v4
# %env DATA_DOWNLOAD_DIR=/workspace/tao-experiments/data
%env USER_EXPERIMENT_DIR=/home/ubuntu/cv_samples_v1.2.0/experiment
%env DATA_DOWNLOAD_DIR=/home/ubuntu/cv_samples_v1.2.0/data

# Set this path if you don't run the notebook from the samples directory.
#%env NOTEBOOK_ROOT=~/cv_samples_v1.2.0/yolo_v4

# Please define this local project directory that needs to be mapped to the TAO docker session.
# The dataset expected to be present in $LOCAL_PROJECT_DIR/data, while the results for the steps
# in this notebook will be stored at $LOCAL_PROJECT_DIR/yolo_v4
# %env LOCAL_PROJECT_DIR=YOUR_LOCAL_PROJECT_DIR_PATH
%env LOCAL_PROJECT_DIR=/home/ubuntu/cv_samples_v1.2.0
os.environ["LOCAL_DATA_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "data")
os.environ["LOCAL_EXPERIMENT_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "yolo_v4")

# The sample spec files are present in the same path as the downloaded samples.
os.environ["LOCAL_SPECS_DIR"] = os.path.join(
    os.getenv("NOTEBOOK_ROOT", os.getcwd()),
    "specs"
)
#%env SPECS_DIR=/workspace/tao-experiments/yolo_v4/specs
%env SPECS_DIR=/home/ubuntu/cv_samples_v1.2.0/yolo_v4/specs

# Showing list of specification files.
!ls -rlt $LOCAL_SPECS_DIR
# Mapping up the local directories to the TAO docker.
import json
mounts_file = os.path.expanduser("~/.tao_mounts.json")

# Define the dictionary with the mapped drives
drive_map = {
    "Mounts": [
        # Mapping the data directory
        {
            "source": os.environ["LOCAL_PROJECT_DIR"],
            "destination": os.environ["LOCAL_PROJECT_DIR"]
        },
        # Mapping the specs directory.
        {
            "source": os.environ["LOCAL_SPECS_DIR"],
            "destination": os.environ["LOCAL_SPECS_DIR"]
        },
    ]
}

# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(drive_map, mfile, indent=4)

#-------- the out put --------------------------
{
“Mounts”: [
{
“source”: “/home/ubuntu/cv_samples_v1.2.0”,
“destination”: “/home/ubuntu/cv_samples_v1.2.0”
},
{
“source”: “/home/ubuntu/cv_samples_v1.2.0/yolo_v4/specs”,
“destination”: “/home/ubuntu/cv_samples_v1.2.0/yolo_v4/specs”
}
]
}

When you run above, the path to $SPECS_DIR/yolo_v4_train_resnet18_kitti_seq.txt should be the path inside the docker.
So, please check your ~/.tao_mounts.json.

This json file will map your local directory to the destination path of the docker.
So, all the setting in your training spec file should be the path of the docker.

More info, see TAO Toolkit Launcher — TAO Toolkit 3.0 documentation

Hi,
according to the following code, in my previouse experiment I set the source and destination to the same direction
#------------------------------------------------------------------------

# Mapping up the local directories to the TAO docker.
import json
mounts_file = os.path.expanduser("~/.tao_mounts.json")

# Define the dictionary with the mapped drives
drive_map = {
    "Mounts": [
        # Mapping the data directory
        {
            "source": os.environ["LOCAL_PROJECT_DIR"],
            "destination": os.environ["LOCAL_PROJECT_DIR"]
        },
        # Mapping the specs directory.
        {
            "source": os.environ["LOCAL_SPECS_DIR"],
            "destination": os.environ["LOCAL_SPECS_DIR"]
        },
    ]
}

# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(drive_map, mfile, indent=4)

and the output is :

{
    "Mounts": [
        {
            "source": "/home/ubuntu/cv_samples_v1.2.0",
            "destination": "/home/ubuntu/cv_samples_v1.2.0"
        },
        {
            "source": "/home/ubuntu/cv_samples_v1.2.0/yolo_v4/specs",
            "destination": "/home/ubuntu/cv_samples_v1.2.0/yolo_v4/specs"
        }
    ]
}

In the specification file is did set the paths accordingly for dataset configuration:

data_sources: {
      label_directory_path: "/home/ubuntu/cv_samples_v1.2.0/training/data/labels"
      image_directory_path: "/home/ubuntu/cv_samples_v1.2.0/training/data/images"
  }

#------------------------------------------------------------------------
What I should have done differently? In addition to this can you please clarify the following parameters from yolo_v4.ipynb
USER_EXPERIMENT_DIR, DATA_DOWNLOAD_DIR,LOCAL_PROJECT_DIR, according to the cv_samples_v1.2.0 for computer vision application? what is the difference between

%env KEY=nvidia_tlt
%env KEY=nvidia_tao

Regards

If you set above, can you “ls” the destination of the docker to find the expected files?
! tao yolo_v4 run ls /home/ubuntu/cv_samples_v1.2.0

Yes I was able to ls thw followings:

!tao yolo_v4 run ls /home/ubuntu/cv_samples_v1.2.0/data
KanataSmaltestV1.mp4	 data_object_label_2.zip  training
data_object_image_2.zip  testing		  val
2021-09-30 03:13:00,248 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

and

!tao yolo_v4 run ls /home/ubuntu/cv_samples_v1.2.0
2021-09-30 03:14:10,150 [INFO] root: Registry: ['nvcr.io']
2021-09-30 03:14:10,228 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
augment		 deps	       faster_rcnn   lprnet		       ssd
aws_s3_mounting  detectnet_v2  fpenet	     mask_rcnn		       unet
bpnet		 dssd	       gazenet	     multitask_classification  yolo_v3
classification	 emotionnet    gesturenet    ngccli		       yolo_v4
data		 facenet       heartratenet  retinanet
2021-09-30 03:14:10,703 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I also had a mistake on spec file: I corrected the paths for training images

data_sources: {
      label_directory_path: "/home/ubuntu/cv_samples_v1.2.0/data/training/labels"
      image_directory_path: "/home/ubuntu/cv_samples_v1.2.0/data/training/images"
  }

And Run and I am getting the follwing error :

Invalid decryption. Unable to open file (unable to open file: name = '/home/ubuntu/cv_samples_v1.2.0/experiment/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0). The key used to load the model is incorrect.

Please run below to check if the hdf5 file is available.
! tao yolo_v4 run ls /home/ubuntu/cv_samples_v1.2.0/experiment/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5

Ok again, It looks like the path was not set properly,
Now how I can prevent doing the validation? I do not have any data for that and I am facing the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/workspace/tao-experiments/data/val/image'

Also, It looks like there are no weights
Traceback (most recent call last):

  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 110, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 494, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 106, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 58, in run_experiment
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/utils.py", line 111, in build_training_pipeline
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/yolov4_model.py", line 588, in build_savers
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/cv_samples_v1.2.0/experiment/experiment_dir_unpruned/weights'

You can set part of training data as val data.

FileNotFoundError: [Errno 2] No such file or directory: ‘/home/ubuntu/cv_samples_v1.2.0/experiment/experiment_dir_unpruned/weights’

Need to
! mkdir experiment_dir_unpruned