Retrain TrafficCamNet using TLT

Hi,

I Want to retrain TrafficCamNet with my custom data. which is having Car, Bus, truck, auto-rickshaw. I have resized it to 940x544 and the labeling is also complete.

I want to know what changes I have to make/ or have to skip the particular step for my own dataset in the Jupiter notebook of detecnet_v2.0 to retrain TrafficCamNet.

Please provide your guidance as I am very new to this.

Regards,
Vikas

What do you mean by “skip the particular step” ?

Hi,

Sorry for the let reply.

what do i mean by “skip the particular step” is as bellow example:

for example for my own custom data i have skip bellow steps in Jupiter notebook as i have my own data i don’t have to download or unzip :

A. Download the dataset

Once you have gotten the download links in your email, please populate them in place of the KITTI_IMAGES_DOWNLOAD_URL and the KITTI_LABELS_DOWNLOAD_URL . This next cell, will download the data and place in $LOCAL_DATA_DIR

import os
!mkdir -p $LOCAL_DATA_DIR
os.environ["URL_IMAGES"]=KITTI_IMAGES_DOWNLOAD_URL
!if [ ! -f $LOCAL_DATA_DIR/data_object_image_2.zip ]; then wget $URL_IMAGES -O $LOCAL_DATA_DIR/data_object_image_2.zip; else echo "image archive already downloaded"; fi
os.environ["URL_LABELS"]=KITTI_LABELS_DOWNLOAD_URL
!if [ ! -f $LOCAL_DATA_DIR/data_object_label_2.zip ]; then wget $URL_LABELS -O $LOCAL_DATA_DIR/data_object_label_2.zip; else \ echo "label archive already downloaded"; fi

B. Verify downloaded dataset

# Check the dataset is present
!if [ ! -f $LOCAL_DATA_DIR/data_object_image_2.zip ]; then echo 'Image zip file not found, please download.'; else echo 'Found Image zip file.';fi
!if [ ! -f $LOCAL_DATA_DIR/data_object_label_2.zip ]; then echo 'Label zip file not found, please download.'; else echo 'Found Labels zip file.';fi
# This may take a while: verify integrity of zip files
!sha256sum $LOCAL_DATA_DIR/data_object_image_2.zip | cut -d ' ' -f 1 | grep -xq '^351c5a2aa0cd9238b50174a3a62b846bc5855da256b82a196431d60ff8d43617$' ; \
if test $? -eq 0; then echo "images OK"; else echo "images corrupt, redownload!" && rm -f $LOCAL_DATA_DIR/data_object_image_2.zip; fi
!sha256sum $LOCAL_DATA_DIR/data_object_label_2.zip | cut -d ' ' -f 1 | grep -xq '^4efc76220d867e1c31bb980bbf8cbc02599f02a9cb4350effa98dbb04aaed880$' ; \
if test $? -eq 0; then echo "labels OK"; else echo "labels corrupt, redownload!" && rm -f $LOCAL_DATA_DIR/data_object_label_2.zip; fi
# unpack downloaded datasets to $DATA_DOWNLOAD_DIR.
# The training images will be under $DATA_DOWNLOAD_DIR/training/image_2 and
# labels will be under $DATA_DOWNLOAD_DIR/training/label_2.
# The testing images will be under $DATA_DOWNLOAD_DIR/testing/image_2.
!unzip -u $LOCAL_DATA_DIR/data_object_image_2.zip -d $LOCAL_DATA_DIR
!unzip -u $LOCAL_DATA_DIR/data_object_label_2.zip -d $LOCAL_DATA_DIR

Now i have make some changes in detecnet_v2.0 Jupiter notebook please let me know if i am doing anything wrong.

1. I have commented below lines as i did not want to download datasets.

A. Download the dataset

import os
!mkdir -p $LOCAL_DATA_DIR
#os.environ["URL_IMAGES"]=KITTI_IMAGES_DOWNLOAD_URL
#!if [ ! -f $LOCAL_DATA_DIR/data_object_image_2.zip ]; then wget $URL_IMAGES -O $LOCAL_DATA_DIR/data_object_image_2.zip; else echo "image archive already downloaded"; fi
#os.environ["URL_LABELS"]=KITTI_LABELS_DOWNLOAD_URL
#!if [ ! -f $LOCAL_DATA_DIR/data_object_label_2.zip ]; then wget $URL_LABELS -O $LOCAL_DATA_DIR/data_object_label_2.zip; else \ echo "label archive already downloaded"; fi

2. Change in download per-trained model

D. Download pre-trained model

# List models available in the model registry.
 !ngc registry model list nvidia/tao/trafficcamnet:*
# Download the pretrained model from NGC
!ngc registry model download-version nvidia/tao/trafficcamnet:unpruned_v1.0 \
    --dest $LOCAL_EXPERIMENT_DIR/pretrained_trafficcamnet

!ls -rlt $LOCAL_EXPERIMENT_DIR/pretrained_trafficcamnet/trafficcamnet_vunpruned_v1.0

3. Change in training specification file.

I have made changes in specification file for bellow:

  1. I have 7 class which is car, bicycle, person, road_sign, Auto_Rickshaw, Truck, Bus
  2. size: 960 x 544
  3. format: PNG

I have make changes in dataset_config, augmentation_config, postprocessing_config, model_config, evaluation_config, cost_function_config, bbox_rasterizer_config.

I am attaching my specification file for training along with detectnetv2 specification file in which i have made changes.

Please let me know if i have done anything wrong. or any other things i have to change.

detectnet_v2_train_trafficcamnet.txt (8.6 KB)
detectnet_v2_train_resnet18_kitti.txt (5.4 KB)

Regards,
vikas

Please modify

  • class name in “cost_function_config” are not correct. It should be the name of 7 classes.
  • Please modify
pretrained_model_file: "/workspace/tao-experiments/detectnet_v2/pretrained_trafficcamnet/trafficcamnet_vunpruned_v1.0"

to

pretrained_model_file: "/workspace/tao-experiments/detectnet_v2/pretrained_trafficcamnet/trafficcamnet_vunpruned_v1.0/resnet18_trafficcamnet.tlt"

Hi,

Thanks for the reply.

I will make changes and than go for training.

Regards,
vikas

HI,

I am Getting following error. what i have to do to remove this error.

Converting Tfrecords for kitti trainval dataset
Traceback (most recent call last):
  File "/home/spectross/.local/bin/tao", line 8, in <module>
    sys.exit(main())
  File "/home/spectross/.local/lib/python3.6/site-packages/tlt/entrypoint/entrypoint.py", line 115, in main
    args[1:]
  File "/home/spectross/.local/lib/python3.6/site-packages/tlt/components/instance_handler/local_instance.py", line 258, in launch_command
    docker_logged_in(required_registry=self.task_map[task].docker_registry)
  File "/home/spectross/.local/lib/python3.6/site-packages/tlt/components/instance_handler/utils.py", line 129, in docker_logged_in
    data = load_config_file(docker_config)
  File "/home/spectross/.local/lib/python3.6/site-packages/tlt/components/instance_handler/utils.py", line 66, in load_config_file
    "No file found at: {}. Did you run docker login?".format(config_path)
AssertionError: Config path must be a valid unix path. No file found at: /home/spectross/.docker/config.json. Did you run docker login?

Regards,
vikas

Please run
docker login nvcr.io

Hi,

Thanks for reply.

I have successfully trained the model. bellow is log. i am getting error when i am retraining the pruned model.please go through the logs and txt file for retrain. detectnet_v2_retrain_resnet18_kitti-def.txt (5.7 KB)

Please help me regrading the error.

4. Run TAO training 
Provide the sample spec file and the output directory location for models 
Note: The training may take hours to complete. Also, the remaining notebook, assumes that the training was done in single-GPU mode. When run in multi-GPU mode, please expect to update the pruning and inference steps with new pruning thresholds and updated parameters in the clusterfile.json accordingly for optimum performance.
Detectnet_v2 now supports restart from checkpoint. Incase the training job is killed prematurely, you may resume training from the closest checkpoint by simply re-running the same command line. Please do make sure to use the same number of GPUs when restarting the training.
When running the training with NUM_GPUs>1, you may need to modify the batc_size_per_gpu and learning_rate to get similar mAP as a 1GPU training run. In most cases, scaling down the batch-size by a factor of NUM_GPU's or scaling up the learning rate by a factor of NUM_GPU's would be a good place to start. 

!tao detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_trafficcamnet.txt \
                        -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                        -k $KEY \
                        -n resnet18_detector \
                        --gpus $NUM_GPUS

Log:
trian_log.txt (846.3 KB)

5. Evaluate the trained model 
!tao detectnet_v2 evaluate -e $SPECS_DIR/detectnet_v2_train_trafficcamnet.txt\
                           -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet18_detector.tlt \
                           -k $KEY

Log:

Validation cost: 0.000108
Mean average_precision (in %): 31.8826

class name       average precision (in %)
-------------  --------------------------
auto_rickshaw                      0
bicycle                           91.3963
bus                               46.8785
car                               84.9031
person                             0
road_sign                          0
truck                              0

Median Inference Time: 0.018872
2021-10-19 10:42:21,726 [INFO] __main__: Evaluation complete.
Time taken to run __main__:main: 0:00:28.154948.
2021-10-19 06:42:24,542 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
  1. Prune the trained model
# Create an output directory if it doesn't exist.
!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment_dir_pruned
!tao detectnet_v2 prune \
                  -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet18_detector.tlt \
                  -o $USER_EXPERIMENT_DIR/experiment_dir_pruned/resnet18_nopool_bn_detectnet_v2_pruned.tlt \
                  -eq union \
                  -pth 0.0000052 \
                  -k $KEY

Log:

2021-10-19 06:46:29,575 [INFO] root: Registry: ['nvcr.io']
2021-10-19 06:46:29,739 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/paperspace/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2021-10-19 10:46:42,283 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices
2021-10-19 10:46:43,349 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph
2021-10-19 10:47:12,937 [INFO] iva.common.magnet_prune: Pruning ratio (pruned model / original model): 0.7215261688691825
2021-10-19 06:47:15,644 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

!ls -rlt $LOCAL_EXPERIMENT_DIR/experiment_dir_pruned/

total 32836
-rw-r--r-- 1 root root 33622248 Oct 19 06:47 resnet18_nopool_bn_detectnet_v2_pruned.tlt
  1. Retrain the pruned model
# Retraining using the pruned model as pretrained weights 
!tao detectnet_v2 train -e $SPECS_DIR/detectnet_v2_retrain_resnet18_kitti-def.txt \
                        -r $USER_EXPERIMENT_DIR/experiment_dir_retrain \
                        -k $KEY \
                        -n resnet18_detector_pruned \
                        --gpus $NUM_GPUS

Log:
retrain_log_.txt (26.5 KB)

Regards,
Vikas

Hi,

Above issue is solved. it was retrain spec file problem. But as you can see in training log while trainig we only get result for three class.

auto_rickshaw                      0
bicycle                           91.3963
bus                               46.8785
car                               84.9031
person                             0
road_sign                          0
truck                              0

what am i doing wrong please help me out.

Regards,
Vikas

Are these 3 classes(person,road_sign,truch, auto_rickshaw) have small objects?
If yes, please set a smaller value of below parameters.

  • minimum_height
  • minimum_width
  • minimum_bounding_box_height

For example,

  evaluation_box_config {
    key: "cyclist"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
  target_class_config {
    key: "cyclist"
    value {
      clustering_config {
        clustering_algorithm: DBSCAN
        dbscan_confidence_threshold: 0.9
        coverage_threshold: 0.00499999988824
        dbscan_eps: 0.15000000596
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 4