TLT detectnet_v2 set training width and height

I have trained a detectnet_v2 model with KITTI formatted dataset which shows avg precision of 58% during evaluation on TLT. But while running in video, it can’t detect any single object properly. Here is the training config file: resnet_train.txt (3.0 KB)
My input images are 1280x720 resolution so where should I set the width-height parameters in the training config file?

When you said above, did you mean you are using deepstream to run inference? If yes, please share the config file of deepstream.

Yes deepstream-app has been used for inferencing. Here is the pgie-config file:

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
model-color-format=0
labelfile-path=detectnet_v2_labels.txt
tlt-encoded-model=resnet18_detector.etlt
tlt-model-key=ZHBmNTA4cHRkNDZwM2****************S00NTZjLTlhOWYtMzI3N2U0ODBiMWU1
#infer-dims=3;544;960
infer-dims=3;720;1280
uff-input-order=0
uff-input-blob-name=input_1
output-blob-names=output_cov/Sigmoid;output_bbox/BiasAdd
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=1
interval=0
gie-unique-id=1
is-classifier=0

[class-attrs-all]
pre-cluster-threshold=0.05
group-threshold=1
eps=0.2
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

How about the tlt-infer result? Can it detect well?

Also, please share the screenshot when you run inference with deepstream.

Here is the result when running tlt-infer command although the evaluation shows 58% avg. precision. The model is trained to detect vehicle license plates. Note that the training images contain the green bboxes which has been dumped by running fd_lpd.caffemodel by deepstream-app and the red bboxes are infer result.

Here’s a sample training image and corresponding label


Labelfile.txt

license_plate 0 0 0 0 508.133331 58.074078 531.066650 71.555557 0 0 0 0 0 0

According to your figure 1, the tlt-infer’s result is also not good.
So, could you please try to run tlt-infer against more images? Then, calculate the average precision.
If it is similar to 58%, I am afraid you need to trigger more experiments to improve the training mAP.

All of the tlt-infer results bbox left coordinate is on 0 in frame. Here’s some infer images in the zip file
tlt-infer.zip (2.0 MB)

So my question is, IS there any need to define image width-height in the training spec file if the training image size is different from the standard KITTI size (i.e. 1248x384) ?

Hey, please note that if you want to train a 1248x384 detectnet_v2 model, you need to resize all the images and labels to 1248x384 offline. If you want to train a 1280x720 detectnet_v2 model, you need to resize all the images and labels to 1280x720 offline.

See https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/supported_model_architectures.html#detectnet-v2

All of my training images are 1280x720 in size. In the label file, all the 0 fields are int but as per documentation, I see some should be float. Could that be a problem?

Suggest to follow the format mentioned in https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/preparing_data_input.html#label-files
The bbox coordinate value are also float.
For example,
cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00

@neuroSparK
Any update, is the issue fixed on your side?
BTW, how about the average sizes of bbox in LPs? Are they too small?

The issue is not fixed yet. Couldn’t find a way or figure out whats going wrong. Anyways, is there any way to find the pretrained model for fd_lpd.caffemodel found in the redaction example?

Hi @neuroSparK,
Could we focus on your tlt-infer incorrect result firstly?
I want to figure out why you get wrong result with tlt-infer.
If possible, could you share the full training log? The latest training spec is also appreciated.

I have found a little formatting error on the label fields. I will retrain with correction and then check again.