Refer to “freeze_blocks” in MaskRCNN — TAO Toolkit 3.22.05 documentation
What do you mean viewing the json file?
Below is an example. When train on COCO dataset, the spec is inside this topic.
Refer to “freeze_blocks” in MaskRCNN — TAO Toolkit 3.22.05 documentation
What do you mean viewing the json file?
Below is an example. When train on COCO dataset, the spec is inside this topic.
(1) How many blocks should I freeze in freeze_blocks if I like to train only MaskRcnn heads using Resnet50?
My accuracy still not good.
AP: 0.149292663
AP50: 0.231117696
AP75: 0.147759020
APl: 0.170746893
APm: 0.034814980
APs: 0.002098019
ARl: 0.381483465
ARm: 0.067172319
ARmax1: 0.236285821
ARmax10: 0.323538810
ARmax100: 0.327527165
ARs: 0.009126985
mask_AP: 0.111917272
mask_AP50: 0.183239624
mask_AP75: 0.114607982
mask_APl: 0.127348930
mask_APm: 0.007120144
mask_APs: 0.000000000
mask_ARl: 0.247205675
mask_ARm: 0.028806869
mask_ARmax1: 0.168297976
mask_ARmax10: 0.207456693
mask_ARmax100: 0.209397107
mask_ARs: 0.000000000
(2) My number of images are 18000.
So for number of steps in config file should be
total_steps = total_images * total_epochs / batch_size / nGPUs
total_steps = 18000 * 20/ 4/ 1 = 90,000
Is it correct?
(3)When I train on 4Gpus with the following command
!tao mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt
-d $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 4
I have error as
[MaskRCNN] ERROR : Job finished with an uncaught exception: `FAILURE
The whole info during training with multiple Gpus is attached.
error.txt (163.8 KB)
There is no baseline for Mapillary dataset. So we cannot draw a conclusion that whether the accuracy is good or not.
Yes I always use a new folder for training. But I have that error when train with multiple GPUs
The following errors happened training with 4 GPUs.
NCCL WARN Error while creating shared memory segment nccl-shm-recv-372f958f789c7514-0-3-0
adae3ba4da04:168:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
adae3ba4da04:168:694 [0] NCCL INFO include/shm.h:41 -> 2
adae3ba4da04:168:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-bac3e042d6e8f0a2-0-3-2 (size 9637888)
adae3ba4da04:168:694 [0] NCCL INFO transport/shm.cc:100 -> 2
adae3ba4da04:168:694 [0] NCCL INFO transport.cc:34 -> 2
adae3ba4da04:168:694 [0] NCCL INFO transport.cc:84 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:753 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:867 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:903 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:916 -> 2
adae3ba4da04:166:702 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
adae3ba4da04:166:702 [0] NCCL INFO include/shm.h:41 -> 2
adae3ba4da04:166:702 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-372f958f789c7514-0-3-0 (size 9637888)
adae3ba4da04:166:702 [0] NCCL INFO transport/shm.cc:100 -> 2
adae3ba4da04:166:702 [0] NCCL INFO transport.cc:34 -> 2
adae3ba4da04:166:702 [0] NCCL INFO transport.cc:84 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:742 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:867 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:903 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:916 -> 2
For above log, please refer to below topic and its solution.
Yes as he mentioned, I also changed to
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
}
}
Then it works
Now need to solve only for accuracy issue.
When I check test images, I see segmentation. Sky (not included in training), road (included in training) are segmented. But car has no bounding box.
Then those bounding box and labelled N/A, I don’t understand.
Is there any misunderstanding for me at tested image shown attached.
How did you get above annotated image? Can you share the command?
This is the command used.
!tao mask_rcnn inference -i $USER_EXPERIMENT_DIR/testimages \
-o $USER_EXPERIMENT_DIR/maskrcnn_annotated_images \
-e $SPECS_DIR/maskrcnn_train_resnet50.txt \
-m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-90000.tlt \
-l $SPECS_DIR/instance_label.txt \
-t 0.5 \
-k $KEY \
--include_mask
Can you attach the full training log? More, how about running inference against some training images?
Thanks for the reply.
My latest training has high and comparable AP50 to the sample training shown by TLT using COCO dataset. AP50=0.33 is quite high.
AP: 0.213585272
AP50: 0.332958937
AP75: 0.210112900
Training log file is attached.
log.txt (6.7 MB)
Total loss reached to (Total loss: 0.916). That is quite good.
My training spec file is also attached.
maskrcnn_train_resnet50.txt (2.1 KB)
The visualization of test images command is as shared before as
!tao mask_rcnn inference -i $USER_EXPERIMENT_DIR/testimages \
-o $USER_EXPERIMENT_DIR/maskrcnn_annotated_images \
-e $SPECS_DIR/maskrcnn_train_resnet50.txt \
-m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-90000.tlt \
-l $SPECS_DIR/instance_label.txt \
-t 0.5 \
-k $KEY \
--include_mask
More of test images are attached.
My objects have car, people, road, etc. as shown in instance_label.txt, but car. people are never detected.
instance_label.txt (173 Bytes)
What could be wrong?
Could you attach your /workspace/tao-experiments/data/mask_rcnn/instances_shape_validation2020.json ?
I’m afraid you need to set correct num_classes: 125
during training.
But I am training for 19 classes only. 18 + 1 background.
My instances_shape_validation2020.json is attached.
instances_shape_validation2020.json (19.8 MB)
Let me retrain with 125 classes.
Here in the original post said 124 classes.
Then he also train with 124 classes.
But in my python code for MapillarytoCOCO conversion, I have only 18 classes as shown in main.py
main.py (9.6 KB)
I changed to 125 classes. Training AP-50 is quite high. Loss is quite low during training.
AP: 0.194935068
AP50: 0.320598215
AP75: 0.191744179
Loss is 0.873
But inference gives still the same.
I don’t understand, sky is not a class in list. But sky is detected with high confidence.
The following are log file and spec file.
log.txt (6.7 MB)
maskrcnn_train_resnet50.txt (2.1 KB)
Test images are still the same.
I don’t know what mistake is made?
I suspect I confused v1.2 and v2.0. I am using config_v1.2.json but dataset is v2.0. So all colors and classes don’t match. Let me check first. Thanks
More, from the log, the l2 loss does not decrease.
Please set lower l2 and retry.
l2_weight_decay: 0.00001
During training with new prepared dataset, I have the following error. How can I debug and fixed the error?
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_15633}} Input to reshape is a tensor with 2960320 values, but the requested shape has 2691200
What did you change?
Can you try a new result folder?
I just use new config_v2.0.json file. I just select those instances name to be use in training. New dataset is prepared with 26+1 classes. That is all what I did.
Yes I use new folder for every new training.
I always have this error when training reached to iteration 1885. How can I check which image is giving this issue?
(0) Invalid argument: Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
(1) Invalid argument: Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_3959]]
Now I have 23 classes using v2.0 dataset. The spec file is attached.
maskrcnn_train_resnet50.txt (2.2 KB)
More infos of errors is as followed.
[MaskRCNN] INFO : RPN total loss: 0.18269
DLL 2022-06-15 04:35:52.187695 - Iteration: 1885 RPN total loss : 0.18269
[MaskRCNN] INFO : Total loss: 2.25939
DLL 2022-06-15 04:35:52.188027 - Iteration: 1885 Total loss : 2.25939
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_15633}} Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
(1) Invalid argument: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_15633}} Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_3959]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
(1) Invalid argument: Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_3959]]
0 successful operations.
0 derived errors ignored.
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Training Performance Summary
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2022-06-15 04:35:57.222443 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2022-06-15 04:35:57.222565 - : Training Performance Summary
DLL 2022-06-15 04:35:57.222612 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2022-06-15 04:35:57.222660 - Average_throughput : 7.1 samples/sec
DLL 2022-06-15 04:35:57.222697 - Total processed steps : 1890
DLL 2022-06-15 04:35:57.222745 - Total_processing_time : 0h 00m 00s
[MaskRCNN] INFO : Average throughput: 7.1 samples/sec
[MaskRCNN] INFO : Total processed steps: 1890
[MaskRCNN] INFO : Total processing time: 0h 00m 00s
DLL 2022-06-15 04:35:57.222996 - : ==================== Metrics ====================
[MaskRCNN] INFO : ==================== Metrics ====================
[MaskRCNN] INFO : FastRCNN box loss: 0.21846
DLL 2022-06-15 04:35:57.223386 - FastRCNN box loss : 0.21846
[MaskRCNN] INFO : FastRCNN class loss: 0.39607
DLL 2022-06-15 04:35:57.223522 - FastRCNN class loss : 0.39607
[MaskRCNN] INFO : FastRCNN total loss: 0.61454
DLL 2022-06-15 04:35:57.223641 - FastRCNN total loss : 0.61454
[MaskRCNN] INFO : L1 loss: 0.0000e+00
DLL 2022-06-15 04:35:57.223747 - L1 loss : 0.0000e+00
[MaskRCNN] INFO : L2 loss: 1.09823
DLL 2022-06-15 04:35:57.223883 - L2 loss : 1.09823
[MaskRCNN] INFO : Learning rate: 0.02
DLL 2022-06-15 04:35:57.224003 - Learning rate : 0.02
[MaskRCNN] INFO : Mask loss: 0.36394
DLL 2022-06-15 04:35:57.224126 - Mask loss : 0.36394
[MaskRCNN] INFO : RPN box loss: 0.06473
DLL 2022-06-15 04:35:57.224247 - RPN box loss : 0.06473
[MaskRCNN] INFO : RPN score loss: 0.11796
DLL 2022-06-15 04:35:57.224368 - RPN score loss : 0.11796
[MaskRCNN] INFO : RPN total loss: 0.18269
DLL 2022-06-15 04:35:57.224485 - RPN total loss : 0.18269
[MaskRCNN] INFO : Total loss: 2.25939
DLL 2022-06-15 04:35:57.224608 - Total loss : 2.25939
[MaskRCNN] ERROR : Job finished with an uncaught exception: `FAILURE`