Detectnet_v2(resnet50) low accuracy on 2 class dataset

pddarrell · January 30, 2023, 9:36am

• Hardware (RTX 3080)
• Network Type (Detectnet_v2)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
v3.22.05-tf1.15.5-py3
• Training spec file:
detectnet_v2_train_resnet50_kitti.txt (4.2 KB)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi I am training an object detection network on a two class database. The classes are ‘healthy’ and ‘damage’ and the images were originated on a bespoke imaging system where there is no rotational variation or colour variation and therefore these augmentation options have not been used. There are variations in image size and so ‘enable_auto_resize: True’ has been used.
When preparing the kitti directories, the ‘healthy’ class files were given 6 digit names starting with a ‘0’ and the ‘damage’ files start with a ‘1’ first digit. The files within each class do not have entirely consecutive numbers. I have read the ‘Data Annotation Format’ page and as far as I can tell the data follows those guidelines.
Also it is very easy to differentiate the classes by human eye.
The results of the evaluation are very bad:

An accuracy of ‘0’ for the ‘healthy’ class suggests that something is set up incorrectly, even though everything runs as expected up to this point. As a sanity check I ran all the cells from the start (with the exception of the ‘kitti object detection’ dataset and the resnet 50 backbone) to ensure that the system had no files from earlier experiments.
Can you offer any advice please?

Morganh · January 31, 2023, 4:45am

Can you share the training log? Is the healthy getting AP 0 all the time?
More, what is the average resolution of the training images? For example, if it is 1024x768, you can set it in the config file. It is suggested to train a model which has the similar input as the training images.

pddarrell · January 31, 2023, 4:29pm

training log 27-01-2023.docx (8.8 KB)

Blockquote
I have run training three times. Each time healthy is 0. (damage was 2.02928, I moved damage ahead of healthy in the configuration file to see if it was an alphabetical issue, as a result damage changed to 0, I read another post that mentioned ‘enable_auto_resize: True’, I tried that and damage AP rose to 24.6958).

Blockquote
Straight out of the capture device the resolution is 3648 x 1417, but there are also smaller cropped images in the datasets (closely matching the quantities across both classes), down to around 500 x 500. In the configuration file under ‘augmentation_config’ I use 1248 x 384, which is close to an exact scale-down of the hi-res images, and an approximate average, with both numbers divisible by 16.

Both classes have the same number of images.

I am wondering whether part of the problem is that an image with damage in it will often show the damage against a wider background setting that is otherwise healthy.

I am hoping not to have to use semantic segmentation for what appears to be a fairly straightforward task.

Morganh · February 3, 2023, 5:52am

It this training log correct? In the log, the training actually does not work.

pddarrell · February 3, 2023, 2:42pm

This is the log that relates to the experiment. I think I misinterpreted the lines at:
2023-01-27 15:40:40,573 [INFO] root: saving trained model
&
2023-01-27 15:40:41,485 [INFO] root: Model saved

I then noted that 2 root errors had been found, but I am unable to fully interpret the alerts. However it seems to have been caused by the file “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”
On my machine I do not have a directory called “tensorflow_core” in the dist-packages directory.

I ran the evaluate cell in the hope that it would help me analyse the issue.

May I request some guidance on this matter.

Morganh · February 6, 2023, 1:54am

From the log "Restoring parameters from /tmp/tmp338kh8c4/model.ckpt-112201
INFO:tensorflow:Running local_init_op.
"
Can you re-run after changing a new result folder in the commandline?

pddarrell · February 6, 2023, 4:08pm

May I make 100% sure I understand you: do you want me to recursively remove the folder named “experiment_dir_unpruned” and then make an empty new directory named “experiment_dir_unpruned” for the training program to re-populate, and then rerun as far as “!tao detectnet_v2 train” and then check the training log?

Morganh · February 6, 2023, 4:40pm

Yes, that is right.

pddarrell · February 6, 2023, 4:49pm

The old folder permissions were “peter:peter”, the new ones are currently “root:root”. Should I change them?

Is this something that could have prevented the training operating correctly, despite no permissions error?

Morganh · February 7, 2023, 1:27am

Yes, you can.
I am requesting you to change the result folder because there is “Restoring parameters” in the log. Not sure if you are resuming training. So just make sure it is a new training.

pddarrell · February 7, 2023, 3:44pm

The training definitely ran this time and it was a new training:

“enable_auto_resize: True” was not changed

However the evaluation result is very similar to (actually worse than) before:

This is still an extraordinarily poor result.
‘Healthy’ has always been zero.
Does this always mean that it makes the wrong prediction every time, or could I have something else set incorrectly?

Could you please offer some further guidance.

Morganh · February 8, 2023, 7:46am

Could you share the latest training spec and upload the full training log? Thanks.

pddarrell · February 8, 2023, 9:50am

training_spec-07Feb2023.docx (7.0 KB)

Blockquote

training_log-7Feb2023.docx (217.7 KB)

Thank you

Morganh · February 8, 2023, 4:15pm

In the training spec, you set as below.
output_image_width: 1248
output_image_height: 384

What is the average resolution of your training dataset?
Suggest to set as close as to it.

Also, can you check the bbox height or width? Is it small? What is the average of height and width?

pddarrell · February 8, 2023, 4:37pm

As I mention in 3 above:
Straight out of the capture device the resolution is 3648 x 1417, but there are also smaller cropped images in the datasets (closely matching the quantities across both classes), down to around 500 x 500 or even less. In the configuration file under ‘augmentation_config’ I use 1248 x 384, which is close to an exact scale-down of the hi-res images, and an approximate average, with both numbers divisible by 16.

Blockquote

The size of a “healthy” bbox always matches the image size, so the average of height and width will vary from 1845.5 for the largest images, down to 350 or less for the smallest images.

A “damage” bbox will have an average in the range 800 down to around 200. Occasionally a “damage” image might fill a small frame.

pddarrell · February 9, 2023, 10:50am

I am struggling to understand how the average precision of the"healthy" class can be zero. In every healthy image the bbox should exactly match the image dimensions. To achieve an accuracy of zero the network would need to be making a prediction outside the image. That should not be possible. An average precision of 100% would seem to be much more likely.

Does the notebook allow the confidence threshold to be adjusted? Maybe it is a confidence issue?

Morganh · February 9, 2023, 10:58am

So, please try to set to 912 x 352 in the training spec.

And set lower minimum_height and minimum_width, for example, 16.

Refer to Frequently Asked Questions — Tao Toolkit
Following parameters can help you improve AP on smaller objects:

Increase num_layers of resnet
class_weight for small objects
Increase the coverage_radius_x and coverage_radius_y parameters of the bbox_rasterizer_config section for the small objects class
Decrease minimum_detection_ground_truth_overlap
Lower minimum_height to cover more small objects for evaluation.

pddarrell · February 9, 2023, 11:16am

Does this refer to “min_bbox_width:” and “min_bbox_height:” in the augmentation_config section? (values for these are also present in several other sections of the configuration file)

Morganh · February 9, 2023, 11:20am

They are in the evaluation section.

pddarrell · February 10, 2023, 11:45am

I followed your advice. I cannot increase layers as I am using resnet50. and cov_radius_x and y are already set to 1.0 for “damage” and this is the maximum value allowed.

This is the result:

A result of 57.9215 is a big step in the right direction and I will continue to tune the parameters.

Does the ubiquitous result of “0” mean something is not set up correctly? - zero is difficult to achieve, isn’t it?

Also, as asked before, how can the network predict a position or size outside the image?

Should I be looking for the solution in the configuration file or elsewhere?

May I also ask what exactly the following means:
“Matching predictions to ground truth, class 1/2.: 100%|█| 53/53 [00:00<00:00, 19234.93it/s]”, in particular, what the figures relate to?

Topic		Replies	Views
Detectnet_v2 acuity is low TAO Toolkit	19	504	July 18, 2023
Object Detection using TAO DetectNet_v2. The category accuracy results are missing TAO Toolkit	17	881	January 13, 2022
Detectnet v2 training :: very low or zero precision TAO Toolkit	4	739	April 17, 2023
Training Custom Object detector with 6 classes TAO Toolkit	27	2481	October 12, 2021
0.0 average precision during a detectnet_v2 training TAO Toolkit	10	602	September 28, 2023
mAP training several classes = 0.0 and so low with data custom using detectnet_v2 (resnet_18)) TAO Toolkit	33	865	February 1, 2024
TLT-Train DetectNetv2 ResNet18 always give mAP 0% at target class TAO Toolkit	11	651	October 26, 2022
Very low precision while Training detectnet_v2 model using custom data in TAO TAO Toolkit	13	1260	May 4, 2023
tlt-train using DetectNet_V2 : getting 0 as average precision for each classes TAO Toolkit	13	1412	October 12, 2021
Mean average precision of 0.00 in training Trafficcamnet model using Tao Toolkit TAO Toolkit deepstream	25	247	January 13, 2025

Detectnet_v2(resnet50) low accuracy on 2 class dataset

Related topics