Detectnet_v2(resnet50) low accuracy on 2 class dataset

• Hardware (RTX 3080)
• Network Type (Detectnet_v2)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file:
detectnet_v2_train_resnet50_kitti.txt (4.2 KB)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi I am training an object detection network on a two class database. The classes are ‘healthy’ and ‘damage’ and the images were originated on a bespoke imaging system where there is no rotational variation or colour variation and therefore these augmentation options have not been used. There are variations in image size and so ‘enable_auto_resize: True’ has been used.
When preparing the kitti directories, the ‘healthy’ class files were given 6 digit names starting with a ‘0’ and the ‘damage’ files start with a ‘1’ first digit. The files within each class do not have entirely consecutive numbers. I have read the ‘Data Annotation Format’ page and as far as I can tell the data follows those guidelines.
Also it is very easy to differentiate the classes by human eye.
The results of the evaluation are very bad:

An accuracy of ‘0’ for the ‘healthy’ class suggests that something is set up incorrectly, even though everything runs as expected up to this point. As a sanity check I ran all the cells from the start (with the exception of the ‘kitti object detection’ dataset and the resnet 50 backbone) to ensure that the system had no files from earlier experiments.
Can you offer any advice please?

Can you share the training log? Is the healthy getting AP 0 all the time?
More, what is the average resolution of the training images? For example, if it is 1024x768, you can set it in the config file. It is suggested to train a model which has the similar input as the training images.

training log 27-01-2023.docx (8.8 KB)

I have run training three times. Each time healthy is 0. (damage was 2.02928, I moved damage ahead of healthy in the configuration file to see if it was an alphabetical issue, as a result damage changed to 0, I read another post that mentioned ‘enable_auto_resize: True’, I tried that and damage AP rose to 24.6958).

Straight out of the capture device the resolution is 3648 x 1417, but there are also smaller cropped images in the datasets (closely matching the quantities across both classes), down to around 500 x 500. In the configuration file under ‘augmentation_config’ I use 1248 x 384, which is close to an exact scale-down of the hi-res images, and an approximate average, with both numbers divisible by 16.

Both classes have the same number of images.

I am wondering whether part of the problem is that an image with damage in it will often show the damage against a wider background setting that is otherwise healthy.

I am hoping not to have to use semantic segmentation for what appears to be a fairly straightforward task.

It this training log correct? In the log, the training actually does not work.

This is the log that relates to the experiment. I think I misinterpreted the lines at:
2023-01-27 15:40:40,573 [INFO] root: saving trained model
2023-01-27 15:40:41,485 [INFO] root: Model saved

I then noted that 2 root errors had been found, but I am unable to fully interpret the alerts. However it seems to have been caused by the file “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/”
On my machine I do not have a directory called “tensorflow_core” in the dist-packages directory.

I ran the evaluate cell in the hope that it would help me analyse the issue.

May I request some guidance on this matter.

From the log "Restoring parameters from /tmp/tmp338kh8c4/model.ckpt-112201
INFO:tensorflow:Running local_init_op.
Can you re-run after changing a new result folder in the commandline?

May I make 100% sure I understand you: do you want me to recursively remove the folder named “experiment_dir_unpruned” and then make an empty new directory named “experiment_dir_unpruned” for the training program to re-populate, and then rerun as far as “!tao detectnet_v2 train” and then check the training log?

Yes, that is right.

The old folder permissions were “peter:peter”, the new ones are currently “root:root”. Should I change them?

Is this something that could have prevented the training operating correctly, despite no permissions error?

Yes, you can.
I am requesting you to change the result folder because there is “Restoring parameters” in the log. Not sure if you are resuming training. So just make sure it is a new training.

The training definitely ran this time and it was a new training:

“enable_auto_resize: True” was not changed

However the evaluation result is very similar to (actually worse than) before:

This is still an extraordinarily poor result.
‘Healthy’ has always been zero.
Does this always mean that it makes the wrong prediction every time, or could I have something else set incorrectly?

Could you please offer some further guidance.

Could you share the latest training spec and upload the full training log? Thanks.

training_spec-07Feb2023.docx (7.0 KB)


training_log-7Feb2023.docx (217.7 KB)

Thank you

In the training spec, you set as below.
output_image_width: 1248
output_image_height: 384

What is the average resolution of your training dataset?
Suggest to set as close as to it.

Also, can you check the bbox height or width? Is it small? What is the average of height and width?

As I mention in 3 above:
Straight out of the capture device the resolution is 3648 x 1417, but there are also smaller cropped images in the datasets (closely matching the quantities across both classes), down to around 500 x 500 or even less. In the configuration file under ‘augmentation_config’ I use 1248 x 384, which is close to an exact scale-down of the hi-res images, and an approximate average, with both numbers divisible by 16.


The size of a “healthy” bbox always matches the image size, so the average of height and width will vary from 1845.5 for the largest images, down to 350 or less for the smallest images.

A “damage” bbox will have an average in the range 800 down to around 200. Occasionally a “damage” image might fill a small frame.

I am struggling to understand how the average precision of the"healthy" class can be zero. In every healthy image the bbox should exactly match the image dimensions. To achieve an accuracy of zero the network would need to be making a prediction outside the image. That should not be possible. An average precision of 100% would seem to be much more likely.

Does the notebook allow the confidence threshold to be adjusted? Maybe it is a confidence issue?

So, please try to set to 912 x 352 in the training spec.

And set lower minimum_height and minimum_width, for example, 16.

Refer to Frequently Asked Questions - NVIDIA Docs
Following parameters can help you improve AP on smaller objects:

  • Increase num_layers of resnet
  • class_weight for small objects
  • Increase the coverage_radius_x and coverage_radius_y parameters of the bbox_rasterizer_config section for the small objects class
  • Decrease minimum_detection_ground_truth_overlap
  • Lower minimum_height to cover more small objects for evaluation.

Does this refer to “min_bbox_width:” and “min_bbox_height:” in the augmentation_config section? (values for these are also present in several other sections of the configuration file)

They are in the evaluation section.

I followed your advice. I cannot increase layers as I am using resnet50. and cov_radius_x and y are already set to 1.0 for “damage” and this is the maximum value allowed.

This is the result:

A result of 57.9215 is a big step in the right direction and I will continue to tune the parameters.

Does the ubiquitous result of “0” mean something is not set up correctly? - zero is difficult to achieve, isn’t it?

Also, as asked before, how can the network predict a position or size outside the image?

Should I be looking for the solution in the configuration file or elsewhere?

May I also ask what exactly the following means:
“Matching predictions to ground truth, class 1/2.: 100%|█| 53/53 [00:00<00:00, 19234.93it/s]”, in particular, what the figures relate to?