Exception on MaskRcnn with different size on TltExport

Hi,

I trained the mask_rcnn on the default notebook, with the input size 832x1344 all fine.
Then, I changed only the “image_size” param to 320x448 (both factors of 64, as guided) and retrained - looks ok.
BUT
when I issue the tlt-export command:

tlt-export mask_rcnn -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-58500.tlt -k $KEY -e $SPECS_DIR/maskrcnn_train_resnet50.txt --batch_size 1 --data_type int8 --cal_image_dir $DATA_DOWNLOAD_DIR/raw-data/val2017 --batches 10 --cal_cache_file $USER_EXPERIMENT_DIR/export/maskrcnn.cal --cal_data_file $USER_EXPERIMENT_DIR/export/maskrcnn.tensorfile

I get:
… output that looks ok …

Marking ['generate_detections', 'mask_head/mask_fcn_logits/BiasAdd'] as outputs
2020-12-09 11:04:21,971 [INFO] iva.mask_rcnn.export.exporter: Converted model was saved into /workspace/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-58500.etlt
8it [00:00, 16.23it/s]
Traceback (most recent call last):
  File "/usr/local/bin/tlt-export", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 185, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 263, in run_export
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/export/exporter.py", line 528, in export
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/base_exporter.py", line 206, in get_calibrator
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/base_exporter.py", line 317, in generate_tensor_file
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/base_exporter.py", line 380, in prepare_chunk
ValueError: operands could not be broadcast together with shapes (320,448) (3,) (320,448) 

And only the maskrcnn.tensorfile gets created, while the .engine and the .cal files are not created.

Any ideas ?

Thanks for the help !

Can you share your $SPECS_DIR/maskrcnn_train_resnet50.txt ?

Hi,

Attached. maskrcnn_train_resnet50.txt (2.0 KB)

See https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/creating_experiment_spec.html#data-config,

Image dimension as a tuple within quote marks. “(height, width)” indicates the dimension of the resized and padded input

The default notebook is training 1344x832 model instead of 832x1344 model.
According to your spec, your model for the same reason is a 448x320 model.
Please check if it matches your requirement.

Hi,

Sorry, buy I’m not following. My goal is to train and infer on images with width 448 and height 320.
I was under the impression that I just need to change the “image_size” param. , then train and export.

Is this the case ?

If not - what else should I change ? In particular:
– Do I have to change the actual input images for the training ?
– Can I use the checkpoint model (resnet50.hdf5) as a starting point ?

Just to clarify - I trained on my modified config file (with the 320x448, and looked at the script that creates the TFRecords and haven’t found any reference to images size.

Thanks for the help !

  1. No, you need not to resize your dataset
  2. Yes, you can use the pretrained model.

I suggest you to narrow down.
a) reading maskrcnn blog https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-the-transfer-learning-toolkit/
b) running the default jupyter notebook (but, set the image_size: “(320, 448)” and set about 1000 total_steps, and learning_rate_steps: “[100, 200, 600]”)

To see if there is the same problem.

Hi,

Thanks, it was a glitch on my side.

But now I have another question:
I trained on 100K iterations on the smaller resolution, and results are really poor. Are there any references file of which params, including the amount of iterations, so the model can give reasonable results ?

Even a sample reference file for the 832x1344 resolution, which gives reasonable results, will be highly appreciated.

(Reasonable results, for example, would be giving Bounding Boxes of more or less the same accuracy as ResNet18 in your sample models on a given image, for common objects (car,person etc), and reasonable masks for these Bounding Boxes).

Thanks for the help !

For the AP result, please refer to the spec at Poor metric results after retraining maskrcnn using TLT notebook - #16 by ghazni ,
it can give reasonable results.

Thanks.