Image size- DetectNet_v2


With reference to DetectNet_v2 — Transfer Learning Toolkit 3.0 documentation ( and the sample Jupyter notebook/dataset

Please could I get some guidance on how to train a model using my data.
The image resolution and aspect ratio in the sample is an unusual 1248x384

My data:

native image size =1920x1080 (16:9 aspect ratio)
annotations converted to KITTI format, which uses absolute sizes rather than relative or percentage- so I’m assuming if I resize images, I’d need to regenerate KITTI annotations ?
TFRecords generated from the above
Don’t necessarily need to run inferencing at this resolution- but certainly at a 16:9 aspect ratio

I get that the width & height need to be multiples of 16-
i.e width of 1920 is ok, but height of 1080 is not (1080/16=67.5)

Does this mean that:

I just change the training augmentation_config to the next multiple i.e output_image_width=1920, output_image_height = 1088 ?? It looks like the images get padded
I need to resize images & annotations to multiples of 16- i.e 1280x720 is the correct aspect ratio and also meets the 16 rule ??
or some other option ??

I think this would be needed by other people as well- so a sample dataset or training config would be fantastic

For now, my training config is the next nearest size divisible by 16 and the training seems to be running ok- but not sure if this is the best approach, or will mean I have to inference at this resolution
detectnet size

thank you


Yes. Set to 1920 and 1088.
More info can be found in DetectNet_v2 — TAO Toolkit 4.0 documentation and DetectNet_v2 — TAO Toolkit 4.0 documentation.

The train tool does not support training on images of multiple resolutions. However, the dataloader does support resizing images to the input resolution defined in the specification file. This can be enabled by setting the enable_auto_resize parameter to true in the augmentation_config module of the spec file.

If the output image height and output image width of the preprocessing block doesn’t match with the dimensions of the input image, the dataloader either pads with zeros or crops to fit to the output resolution. It does not resize the input images and labels to fit.

thanks @Morganh

This obviously means that I need to inference at this resolution- although in this case 8 pixels probably does not matter


Yes, W, H are multiples of 16.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.