Problems On Deploying the Segmentation model trained on TLT-3.0

I have trained and deployed the TLT example based model and the deployment part did’nt have any issue.

But i have some doubts on training session.

I have annotated the images wich has cups in it for instance segmentation.
Images are annotated in Intel-Cvat.
The dataset is single class and has about 6 images(png) of about 310 annotations(with each having 50 annotations) in it.

I am training them in tlt with the resnet 50 on maskrcnn
With batch size of 2 and total steps of about 50000

While deploying the model i cannot see any inference visually, is this issue because of minimal images?

I have some general Question on MaskRcnn

1)What is the minimal number of images preferred to train Segmentation model(instance based single class) ?
2)If the model isn’t trained well,… will it show any false segmentation on deployment ?
3)I gave the tfrecord generated from cvat directly to the TLT to train,. So does that create any problems ?

The png files are not supported for Maskrcnn training.

See https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/open_model_architectures.html#instance-segmentation

MaskRCNN

  • Input size : C * W * H (where C = 3, W > =128, H >=128 and W, H are multiples of 32)
  • Image format : JPG
  • Label format : COCO detection

Thanks for the information.
This time for training, I have used the data-set in JPG format.
The images resolution are multiples of 32.

Problems Faced:

  1. while the evaluation starts i get into an error.

[MaskRCNN] ERROR : Job finished with an uncaught exception: FAILURE

Traceback (most recent call last):
** File “/usr/local/bin/mask_rcnn”, line 8, in **
** sys.exit(main())**
** File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py”, line 12, in main**
** File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job**
AssertionError: Process run failed.

  1. I tired to finish the training by not having the evaluation steps in it and after trained successfully trained, i get no output from model on deployment.

Can you share the full command when you run evaluation?
This traceback should caused by incorrect env path during in your commandline.

I’ll rephrase my problems

Hardware spec:
i9 th gen
NVIDIA 2080TI

Note: “TLT and Deepstream are installed inside the docker of this pc”

I have tried to train the Default TLT MaskRCNN model in the Notebook File provided in TLT.
The model is then exported and worked as expected to be.

Now I am trying to make a custom instance segmentation model using my custom dataset.

For a start I chose a minimal amount of images in the dataset (jpg) for both training and testing.
I then Annotated them using Intel CVAT under a single class(plant).
"Exported as dataset TFrecord"(for .tfrecord) in CVAT and "Exported Annotation as COCO"(for .json).

I then changed the location of the tfrecord and json files to match the location of my desired files.
(Note: Only the location of these files were changed leaving other parameters untouched.)

seed: 123
use_amp: False
warmup_steps: 1000
checkpoint: “/workspace/tlt-experiments/maskrcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5”
learning_rate_steps: “[10000, 15000, 20000]”
learning_rate_decay_levels: “[0.1, 0.02, 0.01]”
total_steps: 25000
train_batch_size: 2
eval_batch_size: 4
num_steps_per_eval: 5000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0001
init_learning_rate: 0.02

After completing these configurations I then started the training which didn’t throw any errors on the starting.
While the evaluation started it throwed errors as:

MaskRCNN] ERROR : Job finished with an uncaught exception: FAILURE
Traceback (most recent call last):
** File “/usr/local/bin/mask_rcnn”, line 8, in **
** sys.exit(main())**
** File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py”, line 12, in main**
** File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job**
AssertionError: Process run failed.

I can restart the training from these evaluation steps by re-running the same training column after the error.
But I cannot find the issue related to evaluation.

Also I cannot initiate inference without any errors.

I had the model .etlt file that got generated during the process.

After Model creation i used the Deepstream Installed inside docker, Built the OSS plugins inside the container.

Using the deployment strategy from the (https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-the-transfer-learning-toolkit/)

I used the same strategy of deploying the model using the deepstream-app with a video input.
Engine Got created and the model got loaded.
The output window didn’t have any segmentation’s in it.

Summarization of problems:

  1. Trying to train a custom MaskRCNN model in TLT and getting errors in Evaluation & Inference.
  2. The model gets converted into an engine without any errors, but has no segmenting appearance in it.

I have been trying to solve these problem for the past 2 weeks and am not able to find the problem within.

I have attached the log files and dataset files used here … please try to replicate the problem im facing

Log Files:

https://drive.google.com/drive/folders/1ZlRnhJIeCnOuMuZUvHjZ6FrtkWTeu6aD?usp=sharing

I cannot access your log files.
More, it is better for you to attach the .ipynb file.

Well received. Seems that the training is unnormal.
The FastRCNN box loss is always 0.

So, Is there any way to solve this problem ?

Please check you own dataset firstly. Please check if the label and mask are correct.

I am using the coco 2017 dataset (Validation archive) as my default dataset due to its reduced size.
I have converted the annotations(instances.json & captions.json) to tfrecord using the TLT.
Using the Tfrecord generated I was able to train a MaskRCNN model successfully.

I am now trying to train a model by custom dataset of a single class.
The dataset has been annotated(both bounding boxes and polygonal annotations to segment) by CVAT and exported as COCO which had only instances.json.

While trying to convert them from tfrecord i got this error:

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
INFO:tensorflow:writing to output path: /tmp/train
I0520 12:33:14.203982 139778702948160 create_coco_tf_record.py:266] writing to output path: /tmp/train
INFO:tensorflow:Building bounding box index.
I0520 12:33:14.236848 139778702948160 create_coco_tf_record.py:212] Building bounding box index.
INFO:tensorflow:0 images are missing bboxes.
I0520 12:33:14.237035 139778702948160 create_coco_tf_record.py:223] 0 images are missing bboxes.
Traceback (most recent call last):
  File "/workspace/examples/maskrcnn/create_coco_tf_record.py", line 333, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/workspace/examples/maskrcnn/create_coco_tf_record.py", line 321, in main
    num_shards=256)
  File "/workspace/examples/maskrcnn/create_coco_tf_record.py", line 275, in _create_tf_record_from_coco_annotations
    _load_caption_annotations(caption_annotations_file))
  File "/workspace/examples/maskrcnn/create_coco_tf_record.py", line 230, in _load_caption_annotations
    caption_annotations = json.load(fid)
  File "/usr/lib/python3.6/json/__init__.py", line 296, in load
    return loads(fp.read(),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/lib/io/file_io.py", line 122, in read
    self._preread_check()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/lib/io/file_io.py", line 84, in _preread_check
    compat.as_bytes(self.__name), 1024 * 512)
tensorflow.python.framework.errors_impl.NotFoundError: ; No such file or directory

While a image from my same custom dataset, annotated with LabelMe

python /workspace/examples/maskrcnn/create_coco_tf_record.py --logtostderr --include_masks --train_image_dir=/workspace/tlt-experiments/data/image/002-1-1.jpg --val_image_dir=/workspace/tlt-experiments/data/image/002-1-1.jpg --test_image_dir=/workspace/tlt-experiments/data/image/002-1-1.jpg --train_object_annotations_file=/workspace/tlt-experiments/data/image/002-1-1.json --val_object_annotations_file=/workspace/tlt-experiments/data/image/002-1-1.json
2021-05-20 14:07:37.548464: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
INFO:tensorflow:writing to output path: /tmp/train
I0520 14:07:38.443266 140494452983616 create_coco_tf_record.py:266] writing to output path: /tmp/train
Traceback (most recent call last):
  File "/workspace/examples/maskrcnn/create_coco_tf_record.py", line 333, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/workspace/examples/maskrcnn/create_coco_tf_record.py", line 321, in main
    num_shards=256)
  File "/workspace/examples/maskrcnn/create_coco_tf_record.py", line 273, in _create_tf_record_from_coco_annotations
    _load_object_annotations(object_annotations_file))
  File "/workspace/examples/maskrcnn/create_coco_tf_record.py", line 207, in _load_object_annotations
    images = obj_annotations['images']
KeyError: 'images'

I didn’t have any error about “caption” files in it.

I don’t know how to proceed from here, is there any tool to annotate using image captioning?
Also, as a general doubt, What is the use of Image Captioning here?.

My Questions:

  1. Is it necessary to have image captioning while training an Instance Segmentation model.
  2. Is there a Preferred tool to annotate?
  3. I have 0 images are missing bboxes error while using the CVAT generated annotation file.

How can I proceed from this point ?

You can add a dummy caption file when you run create_coco_tf_record.py.
For example, adding below
–train_caption_annotations_file=./captions_val2017.json

Thanks.

Also,

  1. Is there any preference of annotation tool that could ease-up my annotation experience while working with custom datasets?

  2. Why do i get " 0 images are missing bboxes " error ?

  • Some users use CVAT
  • It is not an error.

Hello,

I am still facing the problem of getting my own annotated data into training.
So, i’ll fill up the details to replicate the problem i am facing here.

Version Info:

Data-set:
Image count: 5
Image type: jpg
Image Resolution: 800x565
No.Of.Labels (Used): 1
Label-Name: shape

The data-set has been annotated using CVAT (Latest Release From GitHub).
The data-set then has been exported as COCO format (having a
instances_default.json (10.7 KB)
file in it).

The exported file from CVAT are given as annotation file for both training and validation, also the same image for training for testing.
(Note: This decision of using Training,Testing,Validation is trying to make a successful training)

The Download and Preprocess.sh file is edited to use the custom data-set present in local.

download_and_preprocess_coco.sh (2.6 KB)

TLT- Version_Pulled:

nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3

Configs For training:
coco_labels.txt (6 Bytes)
maskrcnn_train_resnet50.txt (2.0 KB)

In the TFrecord Generation process ill always get end-up with this output.

Log:
**+ ‘[’ -z /workspace/tlt-experiments/data ‘]’

  • echo ‘Cloning Tensorflow models directory (for conversion utilities)’
    Cloning Tensorflow models directory (for conversion utilities)
  • ‘[’ ‘!’ -e tf-models ‘]’
  • git clone GitHub - tensorflow/models: Models and examples built with TensorFlow tf-models
    Cloning into ‘tf-models’…
    warning: redirecting to GitHub - tensorflow/models: Models and examples built with TensorFlow
    remote: Enumerating objects: 57199, done.
    remote: Counting objects: 100% (1287/1287), done.
    remote: Compressing objects: 100% (489/489), done.
    remote: Total 57199 (delta 890), reused 1156 (delta 782), pack-reused 55912
    Receiving objects: 100% (57199/57199), 572.84 MiB | 3.02 MiB/s, done.
    Resolving deltas: 100% (39538/39538), done.
  • cd tf-models/research
  • protoc object_detection/protos/anchor_generator.proto object_detection/protos/argmax_matcher.proto object_detection/protos/bipartite_matcher.proto object_detection/protos/box_coder.proto object_detection/protos/box_predictor.proto object_detection/protos/calibration.proto object_detection/protos/center_net.proto object_detection/protos/eval.proto object_detection/protos/faster_rcnn.proto object_detection/protos/faster_rcnn_box_coder.proto object_detection/protos/flexible_grid_anchor_generator.proto object_detection/protos/fpn.proto object_detection/protos/graph_rewriter.proto object_detection/protos/grid_anchor_generator.proto object_detection/protos/hyperparams.proto object_detection/protos/image_resizer.proto object_detection/protos/input_reader.proto object_detection/protos/keypoint_box_coder.proto object_detection/protos/losses.proto object_detection/protos/matcher.proto object_detection/protos/mean_stddev_box_coder.proto object_detection/protos/model.proto object_detection/protos/multiscale_anchor_generator.proto object_detection/protos/optimizer.proto object_detection/protos/pipeline.proto object_detection/protos/post_processing.proto object_detection/protos/preprocessor.proto object_detection/protos/region_similarity_calculator.proto object_detection/protos/square_box_coder.proto object_detection/protos/ssd.proto object_detection/protos/ssd_anchor_generator.proto object_detection/protos/string_int_label_map.proto object_detection/protos/target_assigner.proto object_detection/protos/train.proto --python_out=.
  • touch tf-models/init.py
  • touch tf-models/research/init.py
    +++ readlink -f download_and_preprocess_coco.sh
    ++ dirname /workspace/examples/maskrcnn/download_and_preprocess_coco.sh
  • SCRIPT_DIR=/workspace/examples/maskrcnn
  • PYTHONPATH=tf-models:tf-models/research
  • python /workspace/examples/maskrcnn/create_coco_tf_record.py --logtostderr --include_masks --train_image_dir=/workspace/tlt-experiments/data/images --val_image_dir=/workspace/tlt-experiments/data/images --test_image_dir=/workspace/tlt-experiments/data/images --train_object_annotations_file=/workspace/tlt-experiments/data/instances_default.json --val_object_annotations_file=/workspace/tlt-experiments/data/instances_default.json --train_caption_annotations_file=/workspace/tlt-experiments/data/instances_default.json --val_caption_annotations_file=/workspace/tlt-experiments/data/instances_default.json --testdev_annotations_file=/workspace/tlt-experiments/data/instances_default.json --output_dir=/workspace/tlt-experiments/data/
    2021-05-24 14:15:06.992796: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
    WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
    INFO:tensorflow:writing to output path: /workspace/tlt-experiments/data/train
    I0524 14:15:07.904711 140655449888576 create_coco_tf_record.py:266] writing to output path: /workspace/tlt-experiments/data/train
    INFO:tensorflow:Building bounding box index.
    I0524 14:15:07.946977 140655449888576 create_coco_tf_record.py:212] Building bounding box index.
    INFO:tensorflow:0 images are missing bboxes.
    I0524 14:15:07.947106 140655449888576 create_coco_tf_record.py:223] 0 images are missing bboxes.
    INFO:tensorflow:Building caption index.
    I0524 14:15:07.947659 140655449888576 create_coco_tf_record.py:233] Building caption index.
    INFO:tensorflow:0 images are missing captions.
    I0524 14:15:07.947719 140655449888576 create_coco_tf_record.py:245] 0 images are missing captions.
    multiprocessing.pool.RemoteTraceback:
    “”"
    Traceback (most recent call last):
    File “/usr/lib/python3.6/multiprocessing/pool.py”, line 119, in worker
    result = (True, func(*args, **kwds))
    File “/workspace/examples/maskrcnn/create_coco_tf_record.py”, line 200, in _pool_create_tf_example
    return create_tf_example(*args)
    File “/workspace/examples/maskrcnn/create_coco_tf_record.py”, line 156, in create_tf_example
    captions.append(caption_annotation[‘caption’].encode(‘utf8’))
    KeyError: ‘caption’
    “”"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/workspace/examples/maskrcnn/create_coco_tf_record.py”, line 333, in
app.run(main)
File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 303, in run
_run_main(main, args)
File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “/workspace/examples/maskrcnn/create_coco_tf_record.py”, line 321, in main
num_shards=256)
File “/workspace/examples/maskrcnn/create_coco_tf_record.py”, line 287, in _create_tf_record_from_coco_annotations
for image in images])):
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 735, in next
raise value
KeyError: ‘caption’**

Note:
I’m Struck here for the time being and cant move any further.
I am sure you will be able to replicate the problem im facing with the files provided.

Exported Data from CVAT:

The Images are also provided inside the archive.
Shapes-COCO.zip (174.0 KB)

There is no “caption” field in the instances_default.json. You can just replace it with coco’s captions_val2017.json.

Thanks for the support and help,

The Training went successfully without any errors.

I have some doubts on the inferring part:

  1. The inferred image shows the label as N/A in it so could you point me what’s wrong.

  2. Also, Is there any ways to identify the id of each objects segmented ?

Which inference way did you use? Is it “tlt mask_rcnn inference”?
BTW, did you ever run default jupyter notebook to train and run inference against the default COCO dataset? Is it successful?

Yes, i used the “tlt mask_rcnn inference”.
Yes the default training and inference using the default COCO dataset was successful.

If you run COCO dataset successfully, it will show the labels instead of NA. You can dig out what is the difference.