Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN

Please use one forum user’s dataset (TLT train maskrcnn model with Mapillary Vistas Dataset failed on CUDA_ERROR_OUT_OF_MEMORY: out of memory - #50 by sorata118) or generate tfrecords via the method mentioned in TLT train maskrcnn model with Mapillary Vistas Dataset failed on CUDA_ERROR_OUT_OF_MEMORY: out of memory - #41 by Morganh and then retry.

Because mine is not memory issue. So if i set image_size: “(128, 128)”#“(832, 1344)” to smaller size, do I still need to resize images and json files?

I found this paper.
Yes number of classes is 37 not 147.
Then to increase the accuracy I need to filter out small object classes by setting threshold.
Their loss is also around 2.

I have a few queries
(1)Is it possible to train a few last layers of resnet50 in TAO for MaskRCNN? I just want to finetune heads of maskrcnn, how can I do that?
(2)Which software can I use to view annotations (json) and images of dataset?

Refer to “freeze_blocks” in MaskRCNN - NVIDIA Docs

What do you mean viewing the json file?

Below is an example. When train on COCO dataset, the spec is inside this topic.

(1) How many blocks should I freeze in freeze_blocks if I like to train only MaskRcnn heads using Resnet50?
My accuracy still not good.

AP: 0.149292663
AP50: 0.231117696
AP75: 0.147759020
APl: 0.170746893
APm: 0.034814980
APs: 0.002098019
ARl: 0.381483465
ARm: 0.067172319
ARmax1: 0.236285821
ARmax10: 0.323538810
ARmax100: 0.327527165
ARs: 0.009126985
mask_AP: 0.111917272
mask_AP50: 0.183239624
mask_AP75: 0.114607982
mask_APl: 0.127348930
mask_APm: 0.007120144
mask_APs: 0.000000000
mask_ARl: 0.247205675
mask_ARm: 0.028806869
mask_ARmax1: 0.168297976
mask_ARmax10: 0.207456693
mask_ARmax100: 0.209397107
mask_ARs: 0.000000000

(2) My number of images are 18000.
So for number of steps in config file should be
total_steps = total_images * total_epochs / batch_size / nGPUs
total_steps = 18000 * 20/ 4/ 1 = 90,000
Is it correct?

(3)When I train on 4Gpus with the following command

!tao mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt
-d $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 4

I have error as
[MaskRCNN] ERROR : Job finished with an uncaught exception: `FAILURE

The whole info during training with multiple Gpus is attached.
error.txt (163.8 KB)

  1. From Open Images Pre-trained Instance Segmentation — TAO Toolkit 3.22.05 documentation, the resnet50.hdf5 is pre-trained weight which is trained on a subset of the Google OpenImages dataset.
    For the ResNet series, the block IDs valid for freezing are any subset of [0, 1, 2, 3] (inclusive).

There is no baseline for Mapillary dataset. So we cannot draw a conclusion that whether the accuracy is good or not.

  1. Yes, total_steps = total_images * total_epochs / batch_size / nGPUs
  2. Please use a new result folder when you type the training command.

Yes I always use a new folder for training. But I have that error when train with multiple GPUs
The following errors happened training with 4 GPUs.
NCCL WARN Error while creating shared memory segment nccl-shm-recv-372f958f789c7514-0-3-0

adae3ba4da04:168:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
adae3ba4da04:168:694 [0] NCCL INFO include/shm.h:41 -> 2

adae3ba4da04:168:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-bac3e042d6e8f0a2-0-3-2 (size 9637888)
adae3ba4da04:168:694 [0] NCCL INFO transport/shm.cc:100 -> 2
adae3ba4da04:168:694 [0] NCCL INFO transport.cc:34 -> 2
adae3ba4da04:168:694 [0] NCCL INFO transport.cc:84 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:753 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:867 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:903 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:916 -> 2

adae3ba4da04:166:702 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
adae3ba4da04:166:702 [0] NCCL INFO include/shm.h:41 -> 2

adae3ba4da04:166:702 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-372f958f789c7514-0-3-0 (size 9637888)
adae3ba4da04:166:702 [0] NCCL INFO transport/shm.cc:100 -> 2
adae3ba4da04:166:702 [0] NCCL INFO transport.cc:34 -> 2
adae3ba4da04:166:702 [0] NCCL INFO transport.cc:84 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:742 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:867 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:903 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:916 -> 2

For above log, please refer to below topic and its solution.

Yes as he mentioned, I also changed to
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
}

}
Then it works

Now need to solve only for accuracy issue.
When I check test images, I see segmentation. Sky (not included in training), road (included in training) are segmented. But car has no bounding box.
Then those bounding box and labelled N/A, I don’t understand.
Is there any misunderstanding for me at tested image shown attached.

How did you get above annotated image? Can you share the command?

This is the command used.

!tao mask_rcnn inference -i $USER_EXPERIMENT_DIR/testimages \
                         -o $USER_EXPERIMENT_DIR/maskrcnn_annotated_images \
                         -e $SPECS_DIR/maskrcnn_train_resnet50.txt \
                         -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-90000.tlt \
                         -l $SPECS_DIR/instance_label.txt \
                         -t 0.5 \
                         -k $KEY \
                         --include_mask

Can you attach the full training log? More, how about running inference against some training images?

Thanks for the reply.
My latest training has high and comparable AP50 to the sample training shown by TLT using COCO dataset. AP50=0.33 is quite high.

AP: 0.213585272
AP50: 0.332958937
AP75: 0.210112900

Training log file is attached.
log.txt (6.7 MB)
Total loss reached to (Total loss: 0.916). That is quite good.
My training spec file is also attached.
maskrcnn_train_resnet50.txt (2.1 KB)

The visualization of test images command is as shared before as

!tao mask_rcnn inference -i $USER_EXPERIMENT_DIR/testimages \
                         -o $USER_EXPERIMENT_DIR/maskrcnn_annotated_images \
                         -e $SPECS_DIR/maskrcnn_train_resnet50.txt \
                         -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-90000.tlt \
                         -l $SPECS_DIR/instance_label.txt \
                         -t 0.5 \
                         -k $KEY \
                         --include_mask

More of test images are attached.




My objects have car, people, road, etc. as shown in instance_label.txt, but car. people are never detected.
instance_label.txt (173 Bytes)

What could be wrong?

Could you attach your /workspace/tao-experiments/data/mask_rcnn/instances_shape_validation2020.json ?

I’m afraid you need to set correct num_classes: 125 during training.

But I am training for 19 classes only. 18 + 1 background.
My instances_shape_validation2020.json is attached.
instances_shape_validation2020.json (19.8 MB)

Let me retrain with 125 classes.
Here in the original post said 124 classes.

Then he also train with 124 classes.

But in my python code for MapillarytoCOCO conversion, I have only 18 classes as shown in main.py
main.py (9.6 KB)

Original is 37+1 classes

I changed to 125 classes. Training AP-50 is quite high. Loss is quite low during training.

AP: 0.194935068
AP50: 0.320598215
AP75: 0.191744179

Loss is 0.873
But inference gives still the same.
I don’t understand, sky is not a class in list. But sky is detected with high confidence.

The following are log file and spec file.
log.txt (6.7 MB)
maskrcnn_train_resnet50.txt (2.1 KB)

Test images are still the same.
I don’t know what mistake is made?


I suspect I confused v1.2 and v2.0. I am using config_v1.2.json but dataset is v2.0. So all colors and classes don’t match. Let me check first. Thanks

More, from the log, the l2 loss does not decrease.
Please set lower l2 and retry.
l2_weight_decay: 0.00001