Retrained model shows worse results compared to NVIDIA provided model

DS 7
dGPU.

I retrained detectnet_v2 following this notebook tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2/detectnet_v2.ipynb at main · NVIDIA/tao_tutorials · GitHub, also with the same Kitti dataset as described there.

I literally did not change any of the parameters, just followed the notebook. The training was done using a Gigabyte laptop with GeForce RTX 3060 and did run about 7h. Twice (also for the pruned model).

I copied the resulting files

├── calibration.bin
├── labels.txt
├── nvinfer_config.txt
├── resnet18_detector.onnx

up to my AWS T4 instance and ran the inference there.

I managed to use the resulting files with my DS 7.0 app.

But the results are bad compared to the default NVIDIA resnet18_trafficcamnet.

Here the configuration for the re-trained model resnet18-detector:

[property]
gpu-id=0
net-scale-factor=0.00392156862745098
offsets=0;0;0
infer-dims=3;384;1248
tlt-model-key=tlt_encode
network-type=0
network-mode=2
labelfile-path=models/primary-detector/resnet18-detector/labels.txt
onnx-file=models/primary-detector/resnet18-detector/resnet18_detector.onnx
model-engine-file=models/primary-detector/resnet18-detector/resnet18_detector.onnx_b1_gpu0_fp16.engine
int8-calib-file=models/primary-detector/resnet18-detector/calibration.bin
batch-size=1
num-detected-classes=3
model-color-format=0
maintain-aspect-ratio=0
output-tensor-meta=0
cluster-mode=2
gie-unique-id=1
uff-input-order=0
#output-blob-names=output_cov/Sigmoid;output_bbox/BiasAdd
uff-input-blob-name=input_1


[class-attrs-all]
pre-cluster-threshold=0.2
eps=0.4
group-threshold=1

The configuration for the inherited resnet18-trafficcamnet model:

################################################################################
# SPDX-FileCopyrightText: Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################

# Following properties are mandatory when engine files are not specified:
#   int8-calib-file(Only in INT8)
#   Caffemodel mandatory properties: model-file, proto-file, output-blob-names
#   UFF: uff-file, input-dims, uff-input-blob-name, output-blob-names
#   ONNX: onnx-file
#
# Mandatory properties for detectors:
#   num-detected-classes
#
# Optional properties for detectors:
#   cluster-mode(Default=Group Rectangles), interval(Primary mode only, Default=0)
#   custom-lib-path,
#   parse-bbox-func-name
#
# Mandatory properties for classifiers:
#   classifier-threshold, is-classifier
#
# Optional properties for classifiers:
#   classifier-async-mode(Secondary mode only, Default=false)
#
# Optional properties in secondary mode:
#   operate-on-gie-id(Default=0), operate-on-class-ids(Defaults to all classes),
#   input-object-min-width, input-object-min-height, input-object-max-width,
#   input-object-max-height
#
# Following properties are always recommended:
#   batch-size(Default=1)
#
# Other optional properties:
#   net-scale-factor(Default=1), network-mode(Default=0 i.e FP32),
#   model-color-format(Default=0 i.e. RGB) model-engine-file, labelfile-path,
#   mean-file, gie-unique-id(Default=0), offsets, process-mode (Default=1 i.e. primary),
#   custom-lib-path, network-mode(Default=0 i.e FP32)
#
# The values in the config file are overridden by values set through GObject
# properties.


# RESNET18-TRAFFICCAMNET model

[property]
gpu-id=0
net-scale-factor=0.00392156862745098
model-color-format=0
maintain-aspect-ratio=1
scaling-filter=0
scaling-compute-hw=0
tlt-model-key=tlt_encode
tlt-encoded-model=models/primary-detector/resnet18-trafficcamnet/resnet18_trafficcamnet.etlt
model-engine-file=models/primary-detector/resnet18-trafficcamnet/resnet18_trafficcamnet.etlt_b1_gpu0_fp16.engine
labelfile-path=models/primary-detector/resnet18-trafficcamnet/labels.txt
int8-calib-file=models/primary-detector/resnet18-trafficcamnet/cal_trt.bin
force-implicit-batch-dim=1
batch-size=1
process-mode=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=4
interval=0
gie-unique-id=1
uff-input-order=0
uff-input-blob-name=input_1
output-blob-names=output_cov/Sigmoid;output_bbox/BiasAdd
cluster-mode=2
infer-dims=3;544;960
#operate-on-class-ids=2
#filter-out-class-ids=0;1;3;

[class-attrs-all]
pre-cluster-threshold=0.2
eps=0.4
group-threshold=1

I’m running inference from an FFMPEG feed, which is consumed by RTSP by the inference solution. Then I’m fetching the annotated video back from the RTSP server. The video is an arbitrary Berlin street scene in a loop, which will be removed again after having resolved this problem.

The results obtained with the retrained model resnet18-detector:

The results obtained with the original resnet18-trafficcam model:

Please note, how much better the trafficcam detections look, especially with the green car coming from the left or one of the cyclists. Also the cars, passing by, are detected way more smoother…

Since you are training with KITTI dataset, as mentioned in the notebook, you can split the KITTI dataset and select part of it to run inference against KITTI dataset.
For Berlin street scene, the KITTI dataset may have different data distribution as it, so it is better to use Berlin street scene dataset to train and run inference.
For trafficcamnet pretrained model mentioned in ngc model card, TrafficCamNet | NVIDIA NGC, this trafficCamNet v1.0 model was trained on a proprietary dataset with more than 3 million objects for car class and other classes. So it looks better for the trafficcam detection.

What do you mean by that?

I mean, KITTI is supposed to support autonomous driving, so it should be good enough for even that street scene…

Currently, finetuning trafficcamnet (based on detectnet_v2 network) without forgetting is not supported. But from https://developer.nvidia.com/blog/training-custom-pretrained-models-using-tlt/, with a frozen convolutional layer, the weights do not change in the frozen layer during loss update. This is especially helpful in transfer learning, where you can reuse the features provided by the pretrained weights and reduce training time. You can try to frozen some layers. Also, it is better to retrain with Berlin street scene dataset.

Repeated the training directly from the detectnet_v2 notebook on a T4 instance w/o any parameter change. The results are simply … not acceptable. This is higher bullshit, cannot compete, neither with yolo nor trafficcamnet. You should really think about re-working this tutorial that it at least provides trafficcamnet accuracy, otherwise it is just a big disappointment after 24 h of training.

Or at least lower the expectations while saying, what can be expected.

It’s also not working at least half as good as other models with a NY street scene…

Can you share the training spec file? And what is the resolution of your training images?

Can you share the training spec file? And what is the resolution of your training images?

Well, I just ran the notebook. What could I share, what you don’t already have?

Do you mean you use the default spec file which is the same as tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt at main · NVIDIA/tao_tutorials · GitHub?

Exactly.

Just to tell you where I stand: I never ever dealt with training before. I’m a total newbie here. What else chance do I have as to follow your tutorials? It is by far not self-explanatory, so in order to have at least SOMETHING before going my own way, I was trying to “just” follow your stuff. Wrong?

But since you mention that: My final model just has three classes: car (everything is a car now), pedestrian and cyclist. No van, no sitting person.

If you follow the default notebook(tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2/detectnet_v2.ipynb at main · NVIDIA/tao_tutorials · GitHub) and its spec file(tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt at main · NVIDIA/tao_tutorials · GitHub), please note that

So, if you are using the default notebook and spec file, it is not using trafficcamnet pretrained model.

I’m sure I know this meanwhile. There was also nothing which told I would be doing trafficcamnet training. So I didn’t expect that from the beginning.

But since my app uses currently two different models - yolov7-tiny and trafficcamnet - and both perfom WAYYYY BETTER I was having the obvioulsy wrong expectation, that a costly training, which is praised and advertised as THE way to do things, would at least show up something, which could be used.

It cannot and it was wasted effort to try so, this is what I know now. It would have been better to have known that before.

Thanks anyway for your efforts to help me with both problems. But your TAO is a PITA…

The tutorial notebook shows to end users how to train a detectnet_v2 network with public KITTI dataset and a pretrained model in ngc.
And also the trafficcamnet is actually trained with detectnet_v2 network. But it is trained on a proprietary dataset with more than 3 million objects.

Yepp, you mentioned that. I did expect too much, obvioulsy