I’ll rephrase my problems
Hardware spec:
i9 th gen
NVIDIA 2080TI
Note: “TLT and Deepstream are installed inside the docker of this pc”
I have tried to train the Default TLT MaskRCNN model in the Notebook File provided in TLT.
The model is then exported and worked as expected to be.
Now I am trying to make a custom instance segmentation model using my custom dataset.
For a start I chose a minimal amount of images in the dataset (jpg) for both training and testing.
I then Annotated them using Intel CVAT under a single class(plant).
"Exported as dataset TFrecord"(for .tfrecord) in CVAT and "Exported Annotation as COCO"(for .json).
I then changed the location of the tfrecord and json files to match the location of my desired files.
(Note: Only the location of these files were changed leaving other parameters untouched.)
seed: 123
use_amp: False
warmup_steps: 1000
checkpoint: “/workspace/tlt-experiments/maskrcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5”
learning_rate_steps: “[10000, 15000, 20000]”
learning_rate_decay_levels: “[0.1, 0.02, 0.01]”
total_steps: 25000
train_batch_size: 2
eval_batch_size: 4
num_steps_per_eval: 5000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0001
init_learning_rate: 0.02
After completing these configurations I then started the training which didn’t throw any errors on the starting.
While the evaluation started it throwed errors as:
MaskRCNN] ERROR : Job finished with an uncaught exception: FAILURE
Traceback (most recent call last):
** File “/usr/local/bin/mask_rcnn”, line 8, in **
** sys.exit(main())**
** File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py”, line 12, in main**
** File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job**
AssertionError: Process run failed.
I can restart the training from these evaluation steps by re-running the same training column after the error.
But I cannot find the issue related to evaluation.
Also I cannot initiate inference without any errors.
I had the model .etlt file that got generated during the process.
After Model creation i used the Deepstream Installed inside docker, Built the OSS plugins inside the container.
Using the deployment strategy from the (https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-the-transfer-learning-toolkit/)
I used the same strategy of deploying the model using the deepstream-app with a video input.
Engine Got created and the model got loaded.
The output window didn’t have any segmentation’s in it.
Summarization of problems:
- Trying to train a custom MaskRCNN model in TLT and getting errors in Evaluation & Inference.
- The model gets converted into an engine without any errors, but has no segmenting appearance in it.
I have been trying to solve these problem for the past 2 weeks and am not able to find the problem within.
I have attached the log files and dataset files used here … please try to replicate the problem im facing
Log Files:
https://drive.google.com/drive/folders/1ZlRnhJIeCnOuMuZUvHjZ6FrtkWTeu6aD?usp=sharing