Issue with image classification tutorial and testing with deepstream-app

dzmitry.babrovich · January 13, 2021, 2:42pm

Description

Hi,

I am trying to train image classification model following NVIDIA TLT tutorial: (Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation). It contains 3 classes: Good, Leaked and Scratched.

I chose the spec file as described in the tutorial (Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation)

Here it is:

model_config {

  # Model architecture can be chosen from:
  # ['resnet', 'vgg', 'googlenet', 'alexnet', 'mobilenet_v1', 'mobilenet_v2', 'squeezenet', 'darknet', 'googlenet']

  arch: "resnet"
  
  # for resnet --> n_layers can be [10, 18, 34, 50, 101]
  # for vgg --> n_layers can be [16, 19]
  # for darknet --> n_layers can be [19, 53]


  n_layers: 18
  use_bias: True
  use_batch_norm: True
  all_projections: True
  use_pooling: False
  freeze_bn: False
  freeze_blocks: 0
  freeze_blocks: 1

  # image size should be "3, X, Y", where X,Y >= 16
  input_image_size: "3,224,224"
}

eval_config {
  eval_dataset_path: "/workspace/testing-data"
  model_path: "/workspace/results/weights/resnet_008.tlt"
  top_k: 3
  batch_size: 8
  n_workers: 8
}

train_config {
  train_dataset_path: "/workspace/training-data"
  val_dataset_path: "/workspace/testing-data"
  #pretrained_model_path: "/path/to/your/pretrained/model"
  # optimizer can be chosen from ['adam', 'sgd']

  optimizer: "sgd"
  batch_size_per_gpu: 8
  n_epochs: 8
  n_workers: 16

  # regularizer
  reg_config {
    type: "L2"
    scope: "Conv2D,Dense"
    weight_decay: 0.00005
  }

  # learning_rate

  lr_config {

    # "step" and "soft_anneal" are supported.

    scheduler: "soft_anneal"

    # "soft_anneal" stands for soft annealing learning rate scheduler.
    # the following 4 parameters should be specified if "soft_anneal" is used.
    learning_rate: 0.005
    soft_start: 0.056
    annealing_points: "0.3, 0.6, 0.8"
    annealing_divider: 10
    # "step" stands for step learning rate scheduler.
    # the following 3 parameters should be specified if "step" is used.
    # learning_rate: 0.006
    # step_size: 10
    # gamma: 0.1

    # "cosine" stands for soft start cosine learning rate scheduler.
    # the following 2 parameters should be specified if "cosine" is used.
    # learning_rate: 0.05
    # soft_start: 0.01
  }
}

I resized my training and testing images to 244x244.

Then I just followed the tutorial to train, evaluate and inferencing (using tlt-infer). Interesting thing that once I got my tlt model and pointed tlt-infer to various test images I got correct labels back. Here is a tail of the output produced by tlt-infer:

avg_pool (AveragePooling2D)     (None, 512, 1, 1)    0           block_4b_relu[0][0]              
__________________________________________________________________________________________________
flatten (Flatten)               (None, 512)          0           avg_pool[0][0]                   
__________________________________________________________________________________________________
predictions (Dense)             (None, 3)            1539        flatten[0][0]                    
==================================================================================================
Total params: 11,549,827
Trainable params: 11,372,675
Non-trainable params: 177,152
__________________________________________________________________________________________________
2021-01-13 14:13:49,186 [INFO] iva.makenet.scripts.inference: Processing ./testing-data/Good2/Good_2020_06_29__12_17_42_148169.jpg...
2021-01-13 14:13:49.609370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-13 14:13:49.888716: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Current predictions: [[9.9969721e-01 1.0639867e-05 2.9223689e-04]]
Class label = 0
Class name = Good2

Then I exported the model to etlt format to be used with Jetson Nano:

tlt-export classification -m ./results/weights/resnet_008.tlt -k <my pass> --data_type fp16 -o ./resnet_008.etlt

and created an mp4 video using my test images set (Good, Leaked and Scratched) so I can feed it into deepstream-app.

Then I created config files for deepstream-app (again as per instructions: Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation

Here is my version:

[property]
gpu-id=0
net-scale-factor=1
offsets=123.675;116.28;103.53
#model-color-format=1
batch-size=1

tlt-model-key=<my pass>
tlt-encoded-model=../models/resnet_008.etlt
model-engine-file=../models/resnet_008.etlt_b1_gpu0_fp16.engine

labelfile-path=labels.txt

infer-dims=3;224;224
uff-input-blob-name=input_1

## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2

process-mode=1
interval=0
network-type=1
gie-unique-id=1
output-blob-names=predictions/Softmax
classifier-threshold=0.6

One thing to note about labels example in the tutorial. It seems there is an error. Categories in the labels file should be listed one after another separated with ‘;’ - not one on each line as instructed in the tutorial. So my labels file just contains: Good;Leaked;Scratched;

Also I configured deepstream to output overlay result via rtsp. Here is my config file for the app:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI 4=RTSP
type=3
uri=file://../video/testing.mp4
num-sources=8
#drop-frame-interval=2
gpu-id=0
# (0): memtype_device   - Memory type Device
# (1): memtype_pinned   - Memory type Host Pinned
# (2): memtype_unified  - Memory type Unified
cudadec-memtype=0

[sink2]
enable=1
#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming
type=4
#1=h264 2=h265
codec=1
#encoder type 0=Hardware 1=Software
enc-type=0
sync=0
bitrate=4000000
#H264 Profile - 0=Baseline 2=Main 4=High
#H265 Profile - 0=Main 1=Main10
profile=0
# set below properties in case of RTSPStreaming
rtsp-port=8554
udp-port=5400

[osd]
enable=1
gpu-id=0
border-width=1
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0

[streammux]
gpu-id=0
##Boolean property to inform muxer that sources are live
live-source=1
batch-size=8
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=40000
## Set muxer output width and height
width=1280
height=768
##Enable to maintain aspect ratio wrt source, and allow black borders, works
##along with width, height properties
enable-padding=0
nvbuf-memory-type=0
## If set to TRUE, system timestamp will be attached as ntp timestamp
## If set to FALSE, ntp timestamp from rtspsrc, if available, will be attached
# attach-sys-ts-as-ntp=1

# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
#Required by the app for OSD, not a plugin property
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
nvbuf-memory-type=0
config-file=config_infer.txt

Finally I run the deepstream-app with specified config file, got the engine file built and the app performing inferencing and the resuls overlay could be seen using ffplay via rtsp. Unfortunatelly this is where the model stops working. For some reason I only get ‘Leaked’ label displayed. Always.

I have played with classifier-threshold. Initially I had it at 0.2. If I set it to 0.8 then the label ‘Leaked’ disappeared some times. When I set net-scale-factor to 0.5 I got Leaked and Scratch labels displayed but they weren’t related to what was on the screen at that particular moment.

I have tried increasing number of epoc to 30 during training as well as pruning and retraining but it didn’t make any difference.

So where did I make a mistake in following the tutorial?

Environment

Training machine with GeForce GTX 1650:
Tensorflow: 1.15
Cuda: 10
Training docker container: nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

Deployment jetson nano:
Jetpack 4.4
TensorRT 7.1.3 + cuda 10.2

Morganh · January 14, 2021, 3:23am

Hi @dzmitry.babrovich
First, could you please run inference(via tlt-infer) against a set of test images?

dzmitry.babrovich · January 14, 2021, 8:38am

Hi @Morganh,
As I mentioned above I did run tlt-infer against a set of test images - I added the output that it produced on an image from the Good set. I’ll try again and post more results from different categories of the test set.

dzmitry.babrovich · January 14, 2021, 9:59am

Hi @Morganh

So I have selected a few image examples from testing set of different categories, here are the results:

Command:
tlt-infer classification -m ./results/weights/resnet_008.tlt -k -cm ./results/classmap.json -i ./testing-data/Good2/Good_2020_06_29__12_17_42_148169.jpg
Result:
Current predictions: [[9.9862862e-01 6.6816618e-05 1.3046165e-03]]
Class label = 0
Class name = Good2

Command
tlt-infer classification -m ./results/weights/resnet_008.tlt -k -cm ./results/classmap.json -i ./testing-data/Scratch2/Scratch_2020_07_06__11_32_26_253208.jpg
Result:
Current predictions: [[1.0612786e-06 5.9442841e-06 9.9999297e-01]]
Class label = 2
Class name = Scratch2

Command
tlt-infer classification -m ./results/weights/resnet_008.tlt -k -cm ./results/classmap.json -i ./testing-data/Leakage2/Leaker_2020_07_02__17_28_58_504982.jpg
Result:
Current predictions: [[1.5853602e-05 9.9969363e-01 2.9050530e-04]]
Class label = 1
Class name = Leakage2

So the trained tlt model does seem to work. I understand that selection is not representative but it gives me some idea that the trained model can classify images from different categories.

Morganh · January 14, 2021, 10:10am

Thanks for the info. So, your tlt-infer can work well.
BTW, we can run inference in directory mode to run on a set of test images.
According to the jupyter notebook, the sample command is as below.

tlt-infer classification -m $USER_EXPERIMENT_DIR/output_retrain/weights/resnet_$EPOCH.tlt
-k $KEY -b 32 -d $DATA_DOWNLOAD_DIR/split/test/person
-cm $USER_EXPERIMENT_DIR/output_retrain/classmap.json

The inference result will be saved in test/person/result.csv

dzmitry.babrovich · January 14, 2021, 11:16am

Hi @Morganh

THanks for the hint. I’ve run inference commands against my testing images as per your advice and the results look good to me. Please see the attached file. I issued tlt-infer command agains each category directory (hence 3 files)result_original_Good.csv (87.5 KB) result_original_Leakage.csv (64.1 KB) result_original_Scratch.csv (84.4 KB)

Morganh · January 14, 2021, 2:01pm

Please refer to

How about running mp4 file instead of rtsp?
Could you provide more details about " Unfortunately this is where the model stops working". Any log?

Morganh · January 14, 2021, 4:24pm

Firstly, please run below example and make sure it can run.

Try to config with your rtsp file instead of mp4 file. And make sure it can run.

If above works, please check again your deepstream config along with the inference-config file.
In your inference config file, please modify
infer-dims=3;224;224 to infer-dims=3;224;224;0
net-scale-factor=1 to net-scale-factor=1.0

More, actually there are two ways running a classification model with deepstream.

Run as primary gie
Run as secondary gie

If set as primary gie,
please set process-mode=1 in the inference config file.

If set as secondary gie,
please set process-mode=2 in the inference config file.

dzmitry.babrovich · January 14, 2021, 6:04pm

Hi @Morganh,

Changing infer-dims=3;224;224 to infer-dims=3;224;224;0 produces the error:

Error. 'infer-dims' array length is 4. Should be 3 as [c;h;w] order.
Failed to parse group property

Did you mean input-dims=3;224;224;0

I did that and it didn’t make any difference.

In the mean time I have been trying to follow a different path: train TensoFlow model and convert it to onnx (Transfer learning and fine-tuning | TensorFlow Core). I managed to do it using Resnet50V2 as a base (Resnet50 didn’t work since it had to be converted to onnx with opset 10 which jetson nano TensorRT doesn’t support) and it all worked. I used this config file:

[property]
gpu-id=0
batch-size=1

labelfile-path=labels.txt
onnx-file=../models/fine_model_resnet50v2.onnx
model-engine-file=../models/fine_model_resnet50v2.onnx_b1_gpu0_fp16.engine

## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2

process-mode=1
interval=0
network-type=1
gie-unique-id=1
classifier-threshold=0.8

I am going to give NVIDIA’s tutorial another go and train Resnet model with 50 layers and see if it makes any difference. It does take alot of time though to train on about 7000 images.

I think you can close this ticket since I found a different solution to my problem but I do think that the tutorial needs polishing.

Morganh · January 15, 2021, 10:14am

Yes, it is input-dims=3;224;224;0

Glad to see that you have the solution now.

Actually, when you deploy the tlt model in the config file. It should work.
For example, if you trained a two-classes(person and another class) model with TLT classification network, then you can run inference with below two ways in deepstream.

Work as primary trt engine
ds_classification_as_primary_gie (3.4 KB)
config_as_primary_gie.txt (741 Bytes)

nvidia@nvidia:/opt/nvidia/deepstream/deepstream-5.0/samples/configs/deepstream-app$ deepstream-app -c ds_classification_as_primary_gie

Work as secondary trt engine
ds_classification_as_secondary_gie (3.6 KB)
config_as_secondary_gie.txt (741 Bytes)

nvidia@nvidia:/opt/nvidia/deepstream/deepstream-5.0/samples/configs/deepstream-app$ deepstream-app -c ds_classification_as_secondary_gie

dzmitry.babrovich · January 20, 2021, 8:43am

Yes - as I said I configured three class model as primary trt engine and it didn’t work for me. I only have one label displayed irrespective of the image as I have described above.

Morganh · January 20, 2021, 4:30pm

Please refer to above way I mentioned.
Note that below two lines are not changed.

net-scale-factor=1.0
offsets=123.67;116.28;103.53

And please add below line your your infer spec.

num-detected-classes=3

jazeel.jk · February 10, 2021, 10:13am

Hi @Morganh ,
I could integrate and run my classification model with deepstream… But the classified outputs are wrong with deepstream. With tlt-infer and standalone python program, classification is correct… Do i need to change some parameters in config file to get the results correct?

jazeel.jk · February 10, 2021, 10:21am

Do i need to do some RGB-BGR conversion in the configuration file ? @Morganh

Morganh · February 10, 2021, 10:25am

The model-color-format should be “1” for BGR configuration.

model-color-format = 1

You can try to change below , to check if it helps.

offsets=123.67;116.28;103.53

to

offsets = 103.939;116.779;123.68

jazeel.jk · February 10, 2021, 10:42am

Thanks @Morganh ,
Yea my model-color-format is set as 1. And i have replaced the previous offset with 103.939;116.779;123.68…but still all the frames which are supposed to be belonging to positive class is predicted as negative class…

Morganh · February 10, 2021, 10:48am

Thanks for the info. I will check further.

Morganh · February 10, 2021, 5:31pm

@jazeel.jk
As mentioned above, please modify to

offsets=103.939;116.779;123.68
model-color-format=1

I confirm that it can get the same result as tlt-infer.
Work as primary trt engine
ds_classification_as_primary_gie (3.4 KB)
config_as_primary_gie (3).txt (743 Bytes)

More, please double check your label file.
Yours should be

negative;positive

jazeel.jk · February 11, 2021, 9:37am

@Morganh ,
I checked it, i used the same ds_classification_as_primary_gie as config file for deepstream-app, and label file is in the order of classmap.json… means first one negative and second one positive…
Is there a way to get the predicted outputs printed on the terminal ?

Morganh · February 11, 2021, 2:23pm

With the step I mentioned above, the predicted output will show at the top left corner of the monitor.
For other ways, please search or ask in deepstream forum.

Topic		Replies	Views
Inferring resnet18 classification etlt model with python TAO Toolkit	45	4000	October 12, 2021
Fine-tuned TAO ClassificationTF2 Accuracy Drop after Compiling to TensorRT TAO Toolkit	34	787	August 6, 2024
Tlt resnet18 performance drop between .tlt inference and .engine TAO Toolkit	25	2104	October 4, 2021
Custom TAO unet model classifying only two classes on Deepstream! TAO Toolkit	34	1710	May 12, 2022
Little to no detection on Deepstream-App compared to TLT's infer using the same model TAO Toolkit	6	629	October 12, 2021
Deepstream: create custom age gender model with TLT TAO Toolkit jetson-inference	14	3116	October 12, 2021
Deepstream with tlt resnet50 model giving unknown warning DeepStream SDK	11	692	October 12, 2021
Classification inference huge performance degradation TAO Toolkit	11	1536	February 18, 2022
Inference with TensorRT after training Yolo v4 with TLT 3.0 TAO Toolkit	6	2024	October 12, 2021
Not able to deploy .etlt file in deepstream test app 1 TAO Toolkit	12	1819	October 12, 2021

Issue with image classification tutorial and testing with deepstream-app

Description

Environment

Related topics