OCDnet model keeps failing to train everytime

shivashankarar · May 9, 2024, 7:39am

Please provide the following information when requesting support.

• Hardware (NVIDIA A10G)
• Network Type (Ocdnet VIT and resnet 50 )
• Docker being used : nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt2.1.0
• Training spec file :
“”"
load_pruned_graph: False
pruned_graph_path: ‘/results/prune/pruned_0.1.pth’
pretrained_model_path: ‘/data/ocdnet/ocdnet_fan_tiny_2x_icdar.pth’
backbone: fan_tiny_8_p4_hybrid
#backbone: deformable_resnet18
enlarge_feature_map_size: True
activation_checkpoint: True

train:

results_dir: /home/ubuntu/OCD/OCD-data-results
num_epochs: 80
#resume_training_checkpoint_path: ‘/home/ubuntu/OCD/ocdnet_craft_result/ocd_model_epoch_009.pth’
checkpoint_interval: 1
validation_interval: 1
is_dry_run: False
precision: fp16
model_ema: False
model_ema_decay: 0.999
trainer:
clip_grad_norm: 5.0

optimizer:
type: Adam
args:
lr: 0.001

lr_scheduler:
type: WarmupPolyLR
args:
warmup_epoch: 3

post_processing:
type: SegDetectorRepresenter
args:
thresh: 0.3
box_thresh: 0.55
max_candidates: 1000
unclip_ratio: 1.5

metric:
type: QuadMetric
args:
is_output_polygon: False

dataset:
train_dataset:
data_path: [‘/home/ubuntu/OCD/OCD-data/train-data’]
args:
pre_processes:
- type: IaaAugment
args:
- {‘type’:Fliplr, ‘args’:{‘p’:0.5}}
- {‘type’: Affine, ‘args’:{‘rotate’:[-45,45]}}
- {‘type’:Sometimes,‘args’:{‘p’:0.2, ‘then_list’:{‘type’: GaussianBlur, ‘args’:{‘sigma’:[1.5,2.5]}}}}
- {‘type’:Resize,‘args’:{‘size’:[0.5,3]}}
- type: EastRandomCropData
args:
size: [640,640]
max_tries: 50
keep_ratio: true
- type: MakeBorderMap
args:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- type: MakeShrinkMap
args:
shrink_ratio: 0.4
min_text_size: 8

    img_mode: BGR
    filter_keys: [img_path,img_name,text_polys,texts,ignore_tags,shape]
    ignore_tags: ['*', '###']
  loader:
    batch_size: 2
    pin_memory: true
    num_workers: 4

validate_dataset:
data_path: [‘/home/ubuntu/OCD/OCD-data/test-data’]
args:
pre_processes:
- type: Resize2D
args:
short_size:
- 1280
- 736
resize_text_polys: true
img_mode: BGR
filter_keys:
ignore_tags: [‘*’, ‘###’]
loader:
batch_size: 2
pin_memory: false
num_workers: 1
“”"
ISSUE : I have trained the model more than 10 times with same dataset and made some variations but model still fails to train and loss never drops less than 0.968 and less better with 0.9 and 1.0 . So, Please help me in this .
Text are in dot matrix format and here is a part of image.
images-all
Dataset Size is 5382 images including test and train

Training Terminal Snippet :

Morganh · May 9, 2024, 8:20am

Did you train completely?

How about the images’ resolution?

shivashankarar · May 9, 2024, 8:36am

Did you train completely?
Yes i have trained it completely many times but the results are worst
How about the images’ resolution?
width : 945
height : 1587

Morganh · May 9, 2024, 8:41am

Can you run evaluation with similar resolution against original images?
Change above to

short_size:
- 960
- 1600

shivashankarar · May 9, 2024, 10:22am

you mean

short size
1280 with 1600
and 736 with 960

Morganh · May 9, 2024, 3:53pm

No, I mean width(960) x height(1600) since your images has resolution of width(945) x height(1587).

960
1600

Besides evaluation, you can also run inference to check the result.

shivashankarar · May 10, 2024, 5:12am

My gpu gets exhausted when validation part comes

As my gpu is of 24GB.

Morganh · May 10, 2024, 5:40am

You can try width(800) x height(1344).

shivashankarar · May 10, 2024, 12:42pm

still goes out of memory .

the images are in this format and ignore the image size as it is resized in the code.

Morganh · May 10, 2024, 2:23pm

So, the test images are 1526x2048, right?

Is it running inference? Can you share the command and spec file?

shivashankarar · May 13, 2024, 10:32am

@Morganh Can i use resnet50 model with ocrnet vit model in triton for inferencing. is this possibleas i am facing the issue in running the ocrnet vit with resnet50 ocdnet model .
command used to convert model to engine :
/usr/src/tensorrt/bin/trtexec --onnx=./ocdnet.onnx --minShapes=input:1x3x736x1280 --optShapes=input:1x3x736x1280 --maxShapes=input:4x3x736x1280 --fp16 --saveEngine=./ocdnet.fp16.engine
Note : Ran with resnet50 model worked for inferencing i changed width and height to 1280 and 736

Morganh · May 13, 2024, 10:47am

For ocr-resnet or ocr-vit, both should be working.
But the setting is not the same in triton spec json.

The triton server does not setup. The error

shows that something mismatching in engine generation or spec json setting.

shivashankarar · May 13, 2024, 11:23am

both the model has same input size , right?
this is the spec file used

{
“is_high_resolution_input”: false,
“resize_keep_aspect_ratio”: true,

"overlapRate": 0.5,
"input_data_format": "NHWC",
"ocdnet_trt_engine_path": "/opt/nvocdr/engines/ocdnet_vit.fp16.engine",
"ocdnet_infer_input_shape": [
    3,
    OCD_INPUT_H,
    OCD_INPUT_W
],
"ocdnet_binarize_threshold": 0.3,
"ocdnet_polygon_threshold": 0.1,
"ocdnet_unclip_ratio": 1.5,
"ocdnet_max_candidate": 1000,
"upsidedown": true,
"ocrnet_trt_engine_path": "/opt/nvocdr/engines/ocrnet_vit.fp16.engine",
"ocrnet_dict_file": "/opt/nvocdr/onnx_model/character_list",
"ocrnet_decode": "Attention", 
"ocrnet_infer_input_shape": [
    1,
    64,
    200
],
"font_size": 0.6,
"font_color": [0,0,255]

}

Morganh · May 13, 2024, 2:37pm

This is needed to set.

shivashankarar · May 13, 2024, 3:04pm

yeah, it is present 736 and 1280 but still shows that part.

Morganh · May 13, 2024, 3:11pm

After re-rechecking the log,

The triton server is up. But there is error in ocr engine. Can you double the OCR command to generate tensorrt engine?
Please double check the steps mentioned in the github. We confirm that it is working.
You can try with default models.

shivashankarar · May 14, 2024, 2:20am

okay, i checked it but still failing the tensorrt engine command for my trained resnet50 model was this :

/usr/src/tensorrt/bin/trtexec --onnx=./ocdnet.onnx --minShapes=input:1x3x736x1280 --optShapes=input:1x3x736x1280 --maxShapes=input:4x3x736x1280 --fp16 --saveEngine=/opt/nvocdr/engines/ocdnet.fp16.engine

And i had made changes in spec.json file for engines path but still it is unable to infer
image of width 945 and height 736
Note :
I aam using resnet50 OCDnet model for detection and VIT OCRnet model for recogniton , both are trained on custom data but when it is ran over triton it cant run triton_stub becomes unhealthy and it reloads the models

Morganh · May 14, 2024, 4:47am

How about the command for ocrnet? Did you follow the github to generate ocrnet engine?
See GitHub - NVIDIA-AI-IOT/NVIDIA-Optical-Character-Detection-and-Recognition-Solution: This repository provides optical character detection and recognition solution optimized on Nvidia devices..

For OCDNet, when generate ocd tensorrt engine, if you use default
/usr/src/tensorrt/bin/trtexec --onnx=./ocdnet.onnx --minShapes=input:1x3x736x1280 --optShapes=input:1x3x736x1280 --maxShapes=input:4x3x736x1280 --fp16 --saveEngine=/opt/nvocdr/engines/ocdnet.fp16.engine, then you need to set width 1280 and height 736 in the spec file. The width and height in the spce file should match the width and height in the trtexec command line.

For OCRNet, please check if you are using resnet50 or vit. Then set corresponding height and width in ocrnet part of spec file.

system · June 1, 2024, 5:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OCDNet Tao Model Zoo TAO Toolkit jetson	7	71	October 22, 2024
Unable to train SSD-Resnet-34 TAO Toolkit ssd	9	932	October 12, 2021
Training Custom Object detector with 6 classes TAO Toolkit	27	2271	October 12, 2021
Unable to train SSD-Resnet-18 TAO Toolkit	16	2017	October 12, 2021
Training OCRNet for being used for LPD/LPR DeepStream SDK	66	1074	June 4, 2024
Nvidia TLT TAO Toolkit	15	1659	October 12, 2021
Training issues with OCRNet TAO Toolkit	15	253	July 9, 2024
Using OCRNet from Python script TAO Toolkit	75	837	July 30, 2024
Custom Training TrafficCamNet TAO Toolkit	4	549	October 12, 2021
Error on tlt-training detectnet_v2? TAO Toolkit	6	496	October 12, 2021

OCDnet model keeps failing to train everytime

Related topics