Resnet18 Object Detection Image Resolution Problem

I first tried to train a resnet18 model with publicly available kitti dataset, to get an idea of how the whole TLT works. Though i had to write small scripts to resize all the images to same resolution along side rescaling their bounding boxes accordingly otherwise the training was throwing errors after few epochs.
I successfully trained the model on DGX-1 and to deploying it to Agx-Xavier.
The training went smoothly as all the images have similar resolution so it was very easy to resize them to an uniform resolution in multiples of 16.

Then comes the real part, now i have to train a model on custom dataset (all the required labels in kitti format are already prepared).
Now i also have good experience on building all the required spec files to train a model successfully.

But here’s the issue, the resolutions of the images have very large differences between them i.e., with different aspect ratio.
I can’t simply resize all the images to same resolution as it will destroy the sensitive parts of the images.

I am sharing all the analytics that i have performed over their resolutions so far:

mean     89.12        1.205335        323.0      224.0
std      85.11        0.604291        187.0      226.0
min      3.80         0.000000        49.0       27.0
25%      28.42        1.000000        175.0      99.0
50%      61.94        1.000000        272.0      154.0
75%      118.75       2.000000        427.0      247.0
max      823.51       6.000000        1122.0     1920.0

Histogram: height vs. width

Can anyone please suggest a good way to process these images for the training. Such that it don’t destroy any sensitive information.

The training image size should be closer to input image size. It is better to make training as close as actual.
Also customer could think to use padding method.
480x480 and 640x640, then pad with zero data to 480x480 and make it 640x640. Make sure keep aspect ratio during padding.

I’m also interested in the size of images to train a model, I hope you can read this and give me some advice:
According to Python-apps test3, input data from a rtsp camera is set at 1920x1080. To train a model and run it on this input, should I train on images that big? Wouldn’t that be too much to process?
Or could I train on smaller dimensions with the same height/width relation? Or can the input stream size be reshaped before feeding it to the model?

No, in TLT, you can train a model with a different dimensions, then run inference in deepstream against any size of mp4.

1 Like

Now, for training ALL images must have the same dimensions, even if, like you say, they’re different than the input stream. Am I right?

For training image,please see the requirement of each network. For example,ssd can accept training images with any resolution. But detectnet_v2 and frcnn cannot.
For inference, all the networks accept images with any resolution.