FPENet inference confidence seems low

note: I tried to move it to this forum from here but couldn’t , so this is a copy… (the deepstream forum didn’t know how to answer this)

When running the FPENet deepstream example application on a camera input, the confidence on each landmark seems to be in-between 0.2 and 0.35
When the facial landmarks seem to be very accurate it goes to the 0.3 side, when turning profile and the hidden points are all over the place, it drops to 0.2.

Is this expected or is there something wrong ? I was expecting the values to fluctuate between 0.0 and 1.0

ezgif.com-gif-maker
(the orange bar on the left indicates the average confidence. ignore the blue graph)

Environment

 ./jetsonInfo.py
NVIDIA Jetson UNKNOWN
 L4T 32.7.2 [ JetPack UNKNOWN ]
   Ubuntu 18.04.6 LTS
   Kernel Version: 4.9.253-tegra
 CUDA 10.2.300
   CUDA Architecture: NONE
 OpenCV version: 3.2.0
   OpenCV Cuda: NO
 CUDNN: 8.2.1.32
 TensorRT: 8.2.1.8
 Vision Works: 1.6.0.501
 VPI: ii libnvvpi1 1.2.3 arm64 NVIDIA Vision Programming Interface library
 Vulcan: 1.2.70

Relevant Files

Steps To Reproduce

compile and run deepstream_tao_apps/apps/tao_others/deepstream-faciallandmark-app at release/tao3.0_ds6.0.1 · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub
and print out the retrieved confidence at
deepstream_tao_apps/deepstream_faciallandmark_app.cpp at release/tao3.0_ds6.0.1 · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub

and the confidence doesn’t seem to have a very big range ( 0.2-0.35)

Could you please share the test video or jpeg file?

BTW, did you ever try other test video or jpeg file? For example, you can try the dataset mentioned in jupyter notebook.


note: i’m using the release/tao3.0_ds6.0.1 branch of the https://github.com/NVIDIA-AI-IOT/deepstream_tao_apps.git repo as that is the maximum i can use on our hardware.

i start it up using testvideo.sh :
testvideo.sh (266 Bytes)

I’ve made the changes to printout the confidence:
deepstream_faciallandmark_app.cpp (36.0 KB)

this is the output of the first frame (but the other frames are very much in that same range):

< tao_others/deepstream-faciallandmark-app + release/tao3.0_ds6.0.1 > ./testvideo.sh
+ export LD_LIBRARY_PATH=:/opt/nvidia/deepstream/deepstream/lib/cvcore_libs
+ pwd
+ currentdir=/data/src/taoapps/deepstream_tao_apps/apps/tao_others/deepstream-faciallandmark-app
+ ./deepstream-faciallandmark-app 2 ../../../configs/facial_tao/sample_faciallandmarks_config.txt file:///data/src/taoapps/deepstream_tao_apps/apps/tao_others/deepstream-faciallandmark-app/test.mp4 ./landmarks
Request sink_0 pad from streammux
Now playing: file:///data/src/taoapps/deepstream_tao_apps/apps/tao_others/deepstream-faciallandmark-app/test.mp4
0:00:05.724243834 16898   0x5594a81930 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<second-infer-engine1> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1900> [UID = 2]: deserialized trt engine from :/data/src/taoapps/deepstream_tao_apps/models/faciallandmark/faciallandmarks.etlt_b32_gpu0_fp16.engine
INFO: [FullDims Engine Info]: layers num: 4
0   INPUT  kFLOAT input_face_images 1x80x80         min: 1x1x80x80       opt: 32x1x80x80      Max: 32x1x80x80
1   OUTPUT kFLOAT conv_keypoints_m80 80x80x80        min: 0               opt: 0               Max: 0
2   OUTPUT kFLOAT softargmax      80x2            min: 0               opt: 0               Max: 0
3   OUTPUT kFLOAT softargmax:1    80              min: 0               opt: 0               Max: 0

ERROR: [TRT]: 3: Cannot find binding of given name: softargmax,softargmax:1,conv_keypoints_m80
0:00:05.725646257 16898   0x5594a81930 WARN                 nvinfer gstnvinfer.cpp:635:gst_nvinfer_logger:<second-infer-engine1> NvDsInferContext[UID 2]: Warning from NvDsInferContextImpl::checkBackendParams() <nvdsinfer_context_impl.cpp:1868> [UID = 2]: Could not find output layer 'softargmax,softargmax:1,conv_keypoints_m80' in engine
0:00:05.725689969 16898   0x5594a81930 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<second-infer-engine1> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2004> [UID = 2]: Use deserialized engine model: /data/src/taoapps/deepstream_tao_apps/models/faciallandmark/faciallandmarks.etlt_b32_gpu0_fp16.engine
0:00:06.305827043 16898   0x5594a81930 INFO                 nvinfer gstnvinfer_impl.cpp:313:notifyLoadModelStatus:<second-infer-engine1> [UID 2]: Load new model:../../../configs/facial_tao/faciallandmark_sgie_config.txt sucessfully
0:00:06.306262305 16898   0x5594a81930 WARN                 nvinfer gstnvinfer.cpp:635:gst_nvinfer_logger:<primary-infer-engine1> NvDsInferContext[UID 1]: Warning from NvDsInferContextImpl::initialize() <nvdsinfer_context_impl.cpp:1161> [UID = 1]: Warning, OpenCV has been deprecated. Using NMS for clustering instead of cv::groupRectangles with topK = 20 and NMS Threshold = 0.5
0:00:06.825332491 16898   0x5594a81930 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<primary-infer-engine1> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1900> [UID = 1]: deserialized trt engine from :/data/src/taoapps/deepstream_tao_apps/models/faciallandmark/facenet.etlt_b1_gpu0_fp16.engine
INFO: [Implicit Engine Info]: layers num: 3
0   INPUT  kFLOAT input_1         3x416x736
1   OUTPUT kFLOAT output_bbox/BiasAdd 4x26x46
2   OUTPUT kFLOAT output_cov/Sigmoid 1x26x46

0:00:06.828065434 16898   0x5594a81930 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<primary-infer-engine1> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2004> [UID = 1]: Use deserialized engine model: /data/src/taoapps/deepstream_tao_apps/models/faciallandmark/facenet.etlt_b1_gpu0_fp16.engine
0:00:06.851383402 16898   0x5594a81930 INFO                 nvinfer gstnvinfer_impl.cpp:313:notifyLoadModelStatus:<primary-infer-engine1> [UID 1]: Load new model:../../../configs/facial_tao/config_infer_primary_facenet.txt sucessfully
Decodebin child added: source
Decodebin child added: decodebin0
Running...
Decodebin child added: qtdemux0
Decodebin child added: multiqueue0
Decodebin child added: h264parse0
Decodebin child added: capsfilter0
Decodebin child added: nvv4l2decoder0
Opening in BLOCKING MODE
NvMMLiteOpen : Block : BlockType = 261
NVMEDIA: Reading vendor.tegra.display-size : status: 6
NvMMLiteBlockCreate : Block : BlockType = 261
In cb_newpad
###Decodebin pick nvidia decoder plugin.
score: [0] score: 0.134399
score: [1] score: 0.113953
score: [2] score: 0.124878
score: [3] score: 0.124268
score: [4] score: 0.0649414
score: [5] score: 0.0767212
score: [6] score: 0.128296
score: [7] score: 0.364502
score: [8] score: 0.393311
score: [9] score: 0.255615
score: [10] score: 0.153687
score: [11] score: 0.0911255
score: [12] score: 0.180054
score: [13] score: 0.199707
score: [14] score: 0.170898
score: [15] score: 0.106567
score: [16] score: 0.14624
score: [17] score: 0.443359
score: [18] score: 0.513672
score: [19] score: 0.226807
score: [20] score: 0.269287
score: [21] score: 0.282959
score: [22] score: 0.207153
score: [23] score: 0.21936
score: [24] score: 0.252441
score: [25] score: 0.179932
score: [26] score: 0.308594
score: [27] score: 0.405762
score: [28] score: 0.217651
score: [29] score: 0.217163
score: [30] score: 0.280273
score: [31] score: 0.348633
score: [32] score: 0.430908
score: [33] score: 0.176392
score: [34] score: 0.253174
score: [35] score: 0.231445
score: [36] score: 0.316406
score: [37] score: 0.351807
score: [38] score: 0.383545
score: [39] score: 0.307129
score: [40] score: 0.348633
score: [41] score: 0.337402
score: [42] score: 0.162354
score: [43] score: 0.165649
score: [44] score: 0.151611
score: [45] score: 0.144409
score: [46] score: 0.174438
score: [47] score: 0.171753
score: [48] score: 0.227295
score: [49] score: 0.188232
score: [50] score: 0.197021
score: [51] score: 0.378906
score: [52] score: 0.237427
score: [53] score: 0.30127
score: [54] score: 0.317383
score: [55] score: 0.355713
score: [56] score: 0.333984
score: [57] score: 0.294678
score: [58] score: 0.332764
score: [59] score: 0.36084
score: [60] score: 0.385498
score: [61] score: 0.327637
score: [62] score: 0.349854
score: [63] score: 0.248779
score: [64] score: 0.238525
score: [65] score: 0.315918
score: [66] score: 0.480713
score: [67] score: 0.380371
score: [68] score: 0.309814
score: [69] score: 0.304688
score: [70] score: 0.426025
score: [71] score: 0.269775
score: [72] score: 0.182373
score: [73] score: 0.112
score: [74] score: 0.144897
score: [75] score: 0.190063
score: [76] score: 0.342529
score: [77] score: 0.586914
score: [78] score: 0.0949707
score: [79] score: 0.104126

no i didn’t use any of the notebook test videos… do you have a link to it ? and an indication of what kind of confidence is expected from that video ?

Thanks for the info. I will check with your video.

For FPEnet notebook, please refer to
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html#computer-vision

You can download it via
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/cv_samples/versions/v1.4.1/zip -O cv_samples_v1.4.1.zip
unzip -u cv_samples_v1.4.1.zip -d ./cv_samples_v1.4.1 && rm -rf cv_samples_v1.4.1.zip && cd ./cv_samples_v1.4.1

Or refer to
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/cv_samples/version/v1.4.1/files/fpenet/fpenet.ipynb#head-2

https://ibug.doc.ic.ac.uk/download/annotations/afw.zip/

You can use ffmpeg to generate a video file based on the image files.

thank you, please let me know how you go.

I’ve downloaded the afw.zip images, turned them into a mp4 and am getting similar output

score: [3] score: 0.207153
score: [4] score: 0.0524292
score: [5] score: 0.0820923
score: [6] score: 0.0934448
score: [7] score: 0.182495
score: [8] score: 0.227783
score: [9] score: 0.26001
score: [10] score: 0.185791
score: [11] score: 0.10199
score: [12] score: 0.09375
score: [13] score: 0.0924072
score: [14] score: 0.122253
score: [15] score: 0.124023
score: [16] score: 0.208984
score: [17] score: 0.224609
score: [18] score: 0.144897
score: [19] score: 0.120667
score: [20] score: 0.232544
score: [21] score: 0.209473

with the highest sequence looking like:

score: [69] score: 0.361816
score: [70] score: 0.610352
score: [71] score: 0.402588
score: [72] score: 0.517578
score: [73] score: 0.555176
score: [74] score: 0.577637
score: [75] score: 0.672852

but 90% of the time its like :

re: [9] score: 0.446289
score: [10] score: 0.30127
score: [11] score: 0.0866699
score: [12] score: 0.0640869
score: [13] score: 0.119507
score: [14] score: 0.103333
score: [15] score: 0.112976
score: [16] score: 0.155151
score: [17] score: 0.467529
score: [18] score: 0.270996
score: [19] score: 0.159546
score: [20] score: 0.156494
score: [21] score: 0.216187
is this also what you see ? 
or is there something wrong with my hardware ? 
I did noticed a few warning message in the output:
ERROR: [TRT]: 3: Cannot find binding of given name: softargmax,softargmax:1,conv_keypoints_m80
0:00:05.725646257 16898   0x5594a81930 WARN                 nvinfer gstnvinfer.cpp:635:gst_nvinfer_logger:<second-infer-engine1> NvDsInferContext[UID 2]: Warning from NvDsInferContextImpl::checkBackendParams() <nvdsinfer_context_impl.cpp:1868> [UID = 2]: Could not find output layer 'softargmax,softargmax:1,conv_keypoints_m80' in engine

....

0:00:06.306262305 16898   0x5594a81930 WARN                 nvinfer gstnvinfer.cpp:635:gst_nvinfer_logger:<primary-infer-engine1> NvDsInferContext[UID 1]: Warning from NvDsInferContextImpl::initialize() <nvdsinfer_context_impl.cpp:1161> [UID = 1]: Warning, OpenCV has been deprecated. Using NMS for clustering instead of cv::groupRectangles with topK = 20 and NMS Threshold = 0.5

do you see the same ?

the reason i ask is because i would like to know if i can depend on the facial landmarks. We do some calculation afterwards and if the source points are ‘uncertain’ to a particular degree, we may halt the calculations for a bit until the points are known to be good. but it doesn’t look like i can use the score if they are low even when the points look like they are quite accurately following the eye/nose/mouth.

even if we re-train the model to get a better model, we would still need to be able to depend on the accuracy output.

Also:
what is the output of the conv_keypoints_m80 layer ?
it says in the model card that it is 80x80x80 … but i am having some difficulty to understand what that means ?
is that some sort of confidence per pixel in the image per landmark ? what does it indicate ? is it a float ?

It is not the confidence. The model’s target resolution is 80x80. Refer to Facial Landmarks Estimation - NVIDIA Docs
height: 80
width: 80

I am still checking. Will update if any.

More, may I know which model did you download? Is it Facial Landmarks Estimation | NVIDIA NGC
$ wget ‘https://api.ngc.nvidia.com/v2/models/nvidia/tao/fpenet/versions/deployable_v3.0/files/model.etlt’ ? Or 2.0 , 1.0 version?

I downloaded the one that is done by the download_models.sh in the deepstream_tao_apps (tao3.0_ds6.0.1 branch)

inside the script it does:

echo "==================================================================="
echo "begin downloading facial landmarks model "
echo "==================================================================="
mkdir -p ./models/faciallandmark
cd ./models/faciallandmark
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tao/fpenet/versions/deployable_v3.0/files/model.etlt -O faciallandmarks.etlt
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tao/fpenet/versions/deployable_v3.0/files/int8_calibration.txt -O fpenet_cal.txt
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tao/facenet/versions/pruned_quantized_v2.0/files/model.etlt -O facenet.etlt
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tao/facenet/versions/pruned_quantized_v2.0/files/int8_calibration.txt -O int8_calibration.txt

I run in release/tao3.0_ds6.0.1 branch along with your test files. The confidence result is different from yours. It has bigger range.
Attach the log for reference.
I am running on A6000 device.
CUDA11.4
Deepstream6.0
235681_result.txt (12.1 MB)

Hello @Morganh,

thank you for going through the effort of trying to reproduce this ! I really appreciate it.
yes, it looks like it has a bigger range on your machine.

what could cause such a difference with the same video input file, configuration and model ?

Are you able to run it on the same machine and deepstream version ( 6.0.1) ?

i noticed that you also have the Could not find output layer 'softargmax,softargmax:1,conv_keypoints_m80' in engine warning, so that is good to see. it may not be the problem.

How do we figure out what is going on here ?

Kind regards,
Tom

The other thing that caught my attention during my first run (if i delete the engine files so they get re-generated)

0:00:02.674259239 31830   0x5565aa62d0 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<second-infer-engine1> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1914> [UID = 2]: Trying to create engine from model files
WARNING: INT8 not supported by platform. Trying FP16 mode.
WARNING: [TRT]: onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
WARNING: [TRT]: onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
WARNING: INT8 not supported by platform. Trying FP16 mode.
WARNING: [TRT]: DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU

which would be different on your A6000 machine ?

The log output is very confusing. First it says that INT8 is not supported on my machine (So I assume that is what is detected in the model ?) and it will try to use FP16.
Then i get a message that the model is actually been generated with INT64, and it will need to cast it down to INT32…

INT8,FP16,INT64 and INT32 , so i’m confused what it going on.

Is there a model that I can test for facial landmarks on the Jetson TX2 NX that won’t do the ‘clamping’ … maybe that is causing the low confidence ?

i’m just guessing here.

May I know which Jetson Device are you running on? TX2, or NX, or Xavier?

It’s a TX2 NX

OK, I will check further in a Jetson device.

1 Like

Could you try to change to fp32 or fp16 mode in the config file and run again? Then it will run with generated fp32 or fp16 tensorrt engine.

running with fp32 network-mode=0 doesn’t start at all:

..
..
..
l_tao/../../models/faciallandmark/facenet.etlt_b1_gpu0_fp16.engine failed, try rebuild
0:01:04.959363539  8019   0x556a8ef4d0 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<primary-infer-engine1> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1914> [UID = 1]: Trying to create engine from model files
WARNING: INT8 not supported by platform. Trying FP16 mode.
WARNING: INT8 not supported by platform. Trying FP16 mode.
WARNING: [TRT]: Tactic Device request: 403MB Available: 289MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 403 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 403MB Available: 292MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 403 detected for tactic 4.
Killed

running with fp16 network-mode=2 starts up with some of these:

WARNING: INT8 not supported by platform. Trying FP16 mode.
WARNING: [TRT]: Tactic Device request: 403MB Available: 238MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 403 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 403MB Available: 239MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 403 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 279MB Available: 238MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 279 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 278MB Available: 238MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 278 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 272MB Available: 238MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 272 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 272MB Available: 238MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 272 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 345MB Available: 237MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 345 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 345MB Available: 237MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 345 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 247MB Available: 237MB. Device memory is insufficient to use tactic.
WARNING: [TRT]: Skipping tactic 3 due to insuficient memory on requested size of 247 detected for tactic 4.
WARNING: [TRT]: Tactic Device request: 246MB Available: 238MB. Device memory is insufficient to use tactic.

runs for a while, with a much larger range of confidences 0.0 - 0.8 but it takes 80% cpu, and towards the end it must take too much memory for the device and also gets killed.

ok, i stopped the ui xwindows? to get some more memory and fp16 finished without crashing:

9] score: 0.349121
score: [0] score: 0.140869
score: [1] score: 0.0931396
score: [2] score: 0.171143
score: [3] score: 0.133301
score: [4] score: 0.130859
score: [5] score: 0.272461
score: [6] score: 0.313232
score: [7] score: 0.407959
score: [8] score: 0.470215
score: [9] score: 0.330566
score: [10] score: 0.138428
score: [11] score: 0.0700073
score: [12] score: 0.0723877
score: [13] score: 0.102722
score: [14] score: 0.0836792
score: [15] score: 0.0823364
score: [16] score: 0.0958862
score: [17] score: 0.30127
score: [18] score: 0.201416
score: [19] score: 0.174072
score: [20] score: 0.182617
score: [21] score: 0.334961
score: [22] score: 0.259277
score: [23] score: 0.331299
score: [24] score: 0.172974
score: [25] score: 0.146729

there is a bit more range in the score output but still a lot less than i expected…
from the output file of the A6000 you sent it looks similar now.

the funny thing is that the landmark points seems to be where you would expect them to be… just the confidence numbers don’t seem to match what i see.

So i guess i’m back where i started, that the confidence seems low, but now you are seeing the same…
Am i expecting too much from the confidence ?

May I know in above log, which frame is it?

aaah… this is interesting

it was frame 239 btw, but i found something interesting in the code:

i print out the confidence in secondary data probe at line 394 of
https://forums.developer.nvidia.com/uploads/short-url/ihGh6JXUl7NrrX8D9hQyod1EY1y.cpp

however i now notice that i get multiple “softargmax:1” layers for each frame_number… and the output of the confidence is different for each layer… (see the attached file… there are 20 instances of that layer for frame 239
frame239.txt (44.0 KB) )

do you know why i would get multiple “softargmax:1” layers for each frame ?

it also isn’t consistent in how many of them i get…
frame 12 i receive 3 instances
frame 13 i receive 1 instance
frame 147 i receive 5 instances

Can you run against the 239th frame directly to double check?

i’ll try, but i had a problem running the application against a .jpg input.
this may take a while.

when running the application don’t you see multiple softargmax:1 layers being processed per frame ?