Failed to validate a retrained UNet tensorRT engine

I am trying to retrain UNet with Kitti Dataset by modifying the jupyter Notebook UNet sample.

The rough procedure is below, and it failed at the 7 validation.

  1. training → OK

  2. validation → OK

  3. pruning → OK

  4. retraining → OK

  5. validation ->OK

  6. convert tlt model to tensorRT engine → OK

  7. validation → Failed

    !tao unet evaluate --gpu_index=$GPU_INDEX -e $SPECS_DIR/unet_train_resnet_unet_kitti_retrain.txt \
                     -m $USER_EXPERIMENT_DIR/export/trt.fp32.kitti.retrained.engine \
                     -o $USER_EXPERIMENT_DIR/kitti_experiment_retrain/ \
                     -k $KEY \
                     --verbose
    
  8. inference → OK?

    • this Inference result using the TensorRT Engine generated in 6 seems to be working well.

The attacted files are:

  • The spec file (unet_train_resnet_unet_kitti_retrain.txt)
  • The error log (validation_error.log)

The result of tao info command is below.

dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.22.02
published_date: 02/28/2022```

I found that changing the following in the spec file works fine, but I would like to do it in color mode..
  • model_config {model_input_channels: 3 → 1}
  • dataset_config {input_image_type: “color” → “grayscale”}

Please tell me if there is any information about this error.
Any information would be appreciated.

Thanks.
validation_error.log (7.7 KB)
unet_train_resnet_unet_kitti_retrain.txt (4.3 KB)

Could you share the full command and log when you run step 6 and 8?

thanks for your response.

Step6:

!tao unet export --gpu_index=$GPU_INDEX -m $USER_EXPERIMENT_DIR/kitti_experiment_retrain/weights/model_kitti_retrained.tlt \
               -k $KEY \
               -e $SPECS_DIR/unet_train_resnet_unet_kitti_retrain.txt \
               --data_type fp32 \
               --engine_file $USER_EXPERIMENT_DIR/export/trt.fp32.kitti.retrained.engine \
               --max_batch_size 3 \
               --gen_ds_config \
               --verbose

Step8:

!tao unet inference --gpu_index=$GPU_INDEX -e $SPECS_DIR/unet_train_resnet_unet_kitti_retrain.txt \
                  -m $USER_EXPERIMENT_DIR/export/trt.fp32.kitti.retrained.engine \
                  -o $USER_EXPERIMENT_DIR/kitti_experiment_retrain/ \
                  -k $KEY

Could you please share the logs as well?

The logs are attached.

step6_export.log (24.1 KB)
step8_inference.log (5.8 KB)

Could you try another experiment? Try to run evaluation with the .tlt model instead of the trt engine.

!tao unet evaluate --gpu_index=$GPU_INDEX -e $SPECS_DIR/unet_train_resnet_unet_kitti_retrain.txt
-m your_tlt_model
-o $USER_EXPERIMENT_DIR/kitti_experiment_retrain/
-k $KEY
–verbose

Seems that your already run above in step5. Please share its log with me. Thanks.

The log of the following command is attached.
evaluate_retraind_model_tlt.log (47.7 KB)

!tao unet evaluate --gpu_index=$GPU_INDEX -e $SPECS_DIR/unet_train_resnet_unet_kitti_retrain.txt \
                 -m $USER_EXPERIMENT_DIR/kitti_experiment_retrain/weights/model_kitti_retrained.tlt \
                 -o $USER_EXPERIMENT_DIR/kitti_experiment_retrain/ \
                 -k $KEY \
                 --verbose

The step8 and step6 are using the same tensorrt engine, but there are running against different images.
So, to narrow down, please run several experiments.

Exp1:
Could you run step8 again? Before running, please modify to below in your spec file.

test_images_path:"/workspace/tao-experiments/data/images/test"

test_images_path:"/workspace/tao-experiments/data/images/val"

Exp2,
Please run step6 again. Before running, please modify to below in your spec file.

val_images_path:"/workspace/tao-experiments/data/images/val"
val_masks_path:"/workspace/tao-experiments/data/masks/val"

val_images_path:"/workspace/tao-experiments/data/images/train"
val_masks_path:"/workspace/tao-experiments/data/masks/train"

Exp3:
Please set --max_batch_size 1 when run “tao unet export xxx”.
Then run evaluation again.

The logs are attached.

  ValueError: operands could not be broadcast together with shapes (34,34) **(12,12)** (34,34) 
 -> ValueError: operands could not be broadcast together with shapes (34,34) **(15,15)** (34,34) 

Thanks.

Could you check the resolution of all the images?
$ apt-get install file
$ file /workspace/tao-experiments/data/images/train/*
$ file /workspace/tao-experiments/data/images/val/*
$ file /workspace/tao-experiments/data/images/test/*

The logs are attached.
file_test.log (16.0 KB)
file_train.log (13.0 KB)
file_val.log (3.2 KB)

Each image size is slightly different.

Thanks.

Sorry for late reply. Could you use less val dataset and retry? Please select the images with the same resolution. Thanks.

The result in 5 images with the same resolution was different in the last line as below.

  ValueError: operands could not be broadcast together with shapes (34,34) **(12,12)** (34,34) 
 -> ValueError: operands could not be broadcast together with shapes (34,34) **(17,17)** (34,34) 

The full log is attached.
evalate_with_same_size.log (7.5 KB)