Is there anything that needs extra attention when using my own engine file?

Here is my config:

• Hardware Platform (Jetson / GPU): 2080Ti or TITAN V
• DeepStream Version: 5.0
• TensorRT Version: 7.0.0-1+cuda10.2
• NVIDIA GPU Driver Version: 440.33.01

Recently I tried to build a face recognition detector based on the sample program deepstream-test5. The model of the face recognition detector comes from: RetinaFace. I compiled the corresponding .engine file according to the project’s READMA, completed and compiled the .so file containing the “parse-bbox-func” function according to the DS SDK, and modified the config.txt file.

But during execution, I found that my program could hardly make any inferences. I looked at the output and found that the model’s confidence in almost all Anchor predictions is 0.5. After checking the entire process of the entire pipeline, I found that the output of the model before the last Softmax layer is almost all 0, which means that my model parameters are hardly loaded.

My configuration is basically the same as SampleSSD, and the corresponding decoder and parser are implemented. But when I set my engine file and config file to the SSD model, it can run normally.

Did I overlook any additional details that need attention? I noticed that the SSD model uses .uff, and my model uses a single engine file. Is there any impact on this?

Here is part my my config file

test5_config_file_src_infer.txt

[primary-gie]
enable=1
gpu-id=0
batch-size=1
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;1;1;1
bbox-border-color3=0;1;0;1
nvbuf-memory-type=0
interval=0
gie-unique-id=1
model-engine-file=/shared/tensorrtx/retinaface/build/retina_r50_CUDA0.engine
labelfile-path=labels.txt
config-file=config_infer_primary.txt

config_infer_primary.txt

[property]
gpu-id=0
net-scale-factor=1
offsets=117.0;104.0;123.0
model-color-format=1
model-engine-file=/shared/tensorrtx/retinaface/build/retina_r50_CUDA0.engine
labelfile-path=labels.txt
batch-size=1
network-mode=2
num-detected-classes=0
network-type=0
interval=0
gie-unique-id=1
is-classifier=0
parse-bbox-func-name=NvDsInferParseCustomRFace
custom-lib-path=…/nvdsinfer_custom_impl_RetinaFace/libnvdsinfer_custom_impl_RetinaFace.so

THX!!!

1 Like

Hi,

Would you mind to run your model with TensorRT directly first?

$ /usr/src/tensorrt/bin/trtexec [info] 

This will help us to find out the issue comes from TensorRT inference or Deepstream SDK.
Thanks.

Hi AastaLLL

Thanks for your reply!

In the past two days, I have checked my code in detail to confirm that my model is working properly. Finally I found the problem:

Around lines 287~292 of nvdsinfer/nvdsinfer_backend.cpp, DS do the model inference:

if (!m_Context->enqueue(batchDims.batchSize, bindingBuffers.data(), stream,
    (consumeEvent ? &consumeEvent->ptr() : nullptr)))
{
    dsInferError("Failed to enqueue inference batch");
    return NVDSINFER_TENSORRT_ERROR;
}

I followed the code used by the model library to modify the part as follows:

cudaStream_t _stream;
CHECK(cudaStreamCreate(&_stream));
if (!m_Context->enqueue(batchDims.batchSize, bindingBuffers.data(), _stream, nullptr))
{
    dsInferError("Failed to enqueue inference batch");
    return NVDSINFER_TENSORRT_ERROR;
}

DS can finally execute my model correctly.

It seems that this is due to the stream parameter. DS uses a globally created stream variable m_InferStream, and my model creates a new empty cudaStream_t each time it is executed. How does this variable affect the work of the model itself? If I want to make my model work without modifying the nvdsinfer library, what kind of modifications do I need to make?

Thank you for your attention and answer!

Can anyone help me? TvT

Hi,

We want to check this issue deeper and reproduce it on our environment
Would you mind to share the following information with us?

  1. libnvdsinfer_custom_impl_RetinaFace.so
  2. test5_config_file_src_infer.txt (please skip this if the content is identical to the above)
  3. config_infer_primary.txt (please skip this if the content is identical to the above)
  4. Steps to compile retina_r50_CUDA0.engine (since engine file is device-dependent)

Thanks.

Hi, AastaLLL!

libnvdsinfer_custom_impl_RetinaFace.zip (31.0 KB)

Here is my libnvdsinfer_custom_impl_RetinaFace.so, test5_config_file_src_infer.txt, config_infer_primary.txt and labels.txt.


About compile retina_r50_CUDA0.engine. I mainly follow the README of the RetinaFace project. The specific steps are as follows:

  1. generate retinaface.wts
git clone https://github.com/wang-xinyu/Pytorch_Retinaface.git

Then download its weights Resnet50_Final.pth from Google Drive and put it in Pytorch_Retinaface/weights.

cd Pytorch_Retinaface
python detect.py --save_model # python3 with Pytorch
python genwts.py # python3 with Pytorch

Then a file ‘retinaface.wts’ will be generated.

  1. generate retina_r50.engine
cd ..
git clone https://github.com/wang-xinyu/tensorrtx.git
cp Pytorch_Retinaface/retinaface.wts tensorrtx/retinaface/retinaface.wts
cd tensorrtx/retinaface

In order for this model to be applied to DS, the source code needs to be slightly modified. In retina_r50.cpp, line 26, it needs to be modified to:

#define BATCH_SIZE 4

Then build:

mkdir build
cd build
cmake ..
make
./retina_50 -s  

Then the model retina_r50.engine will be generated.


It should be noted that because the model is based on Anchor’s image detection, the bug stated here will cause the network to be unable to use the NMS strategy to filter out any results and cause the pipeline too slow. You can set the num-detected-classes of the [property] block to 0 (line 58) in config_infer_primary.txt for debug.

Thanks for your attention!!!

Hi,

Do you have the source of libnvdsinfer_custom_impl_RetinaFace.so ?
We try to reproduce this issue but found libnvdsinfer_custom_impl_RetinaFace.so is not compatible to our environment.

Would you mind to share the source so we can build it directly?

Thanks.

Hi, AastaLLL!

code.zip (6.9 KB)

Here is the code of my project. You may need modify CUDA_VER?=10.2 in Makefile to fit your platform.

Thanks for your attention!!!

Hi, AastaLLL!

After some re-testing, I found a better way to make my model work properly:

Around lines 1092 of nvdsinfer/nvdsinfer_backend_impl.cpp, the stream used by DS to execute the inference process is initialized here:

m_InferStream = std::make_unique<CudaStream>(cudaStreamNonBlocking);

I modified the initialization flag:

m_InferStream = std::make_unique<CudaStream>(cudaStreamDefault);

Surprisingly, my model can run normally. After consulting the CUDA documentation, I found that the difference between these two flags is related to stream synchronization. Is this the main reason of my problem?

Hope this experiment helps you to better find the problem. If possible, I want to try my best to make my model run normally by modifying the code of generation of the model file instead of modifying the code of the nvdsinfer plug-in itself.

Thanks for your attention!!!

Hi,

Thanks for your update.
We try the sample on our environment today but somehow the Deepstream doesn’t work.
The display is entirely black without any result.

Do you also meet this issue before.
Replacing the engine file and bbox parser from YOLOv3 sample, the pipeline can work correctly.

Thanks.

Hi, AastaLLL!

Since I am using a docker environment, I have not tried direct screen output. In the previous experiment, I used the [sink2] (file output) to check whether the result is correct. I did encounter the problem that the output video file could not be played before, but at that time I paid more attention to the inference bug stated above instead of carefully checking this issue.

Is it the problem caused by the slow processing speed? Due to the bug stated above, my model will output 60000+ detection results for each frame and cannot be eliminated by NMS (because no confidence regression is made), which will cause the entire pipeline to get stuck.

Thanks for your attention!!!

Hi,

Actually, we just thought that there are some issue on reproducing since the display is non-functional.

We check the inference result via the nvdsinfer_custom_impl_RetinaFace plugin parser directly.
And found the output is good with the default cudaStreamNonBlocking setting.

...
Opening in BLOCKING MODE 
NvMMLiteOpen : Block : BlockType = 261 
NVMEDIA: Reading vendor.tegra.display-size : status: 6 
NvMMLiteBlockCreate : Block : BlockType = 261 
** INFO: <bus_callback:167>: Pipeline running

bounding box = 60701
0: 1340, 84, 16, 16
1000: 228, 4, 16, 16
2000: 972, 12, 16, 16
3000: 740, 132, 16, 16
4000: 1292, 108.003, 16, 16
5000: 444, 92, 32, 32
6000: 0, 172, 12, 16
7000: 876, 180, 32, 32
8000: 1524, 204, 16, 16
9000: 828, 132, 31.9999, 32
10000: 916, 0, 31.9999, 20
11000: 428, 228, 32, 32.0002
12000: 44.0007, 244, 16, 16.0002
13000: 708, 68, 32, 32
14000: 516, 260, 32, 32
15000: 388, 292, 16, 16
16000: 1452, 268, 32, 32
17000: 1004, 324, 32, 32
18000: 644, 364, 16, 16
19000: 308, 436, 16, 16
20000: 756, 500, 16, 15.9999
21000: 1180, 516, 16, 16
22000: 1036, 540, 16, 16
23000: 1276, 580, 16, 16
24000: 564, 588, 16, 16
25000: 980, 428, 32, 32
26000: 1572, 660, 16, 16
27000: 284, 444, 32, 32
28000: 1188, 612, 32, 32.0001
29000: 924.112, 628, 32, 32
30000: 1412, 372, 32, 32
31000: 1539.99, 620, 32.0137, 32
32000: 1036, 684, 16, 16
33000: 252, 580, 32, 32
34000: 1444, 692, 16, 15.9999
35000: 908.028, 700, 15.9999, 16
36000: 532, 748, 16, 16
37000: 20, 772, 16, 16
38000: 36, 812, 16, 16
39000: 996, 828.001, 16, 16
40000: 1572, 892, 16, 16
41000: 1252, 796, 32, 32.0001
42000: 1476, 724, 32, 32
43000: 452, 828, 32, 32
44000: 1260, 804, 32, 32
45000: 20, 908, 32, 19
46000: 1396.01, 820, 31.9999, 32
47000: 56.1423, 120, 64, 64
48000: 232, 294.265, 64, 67.4707
49000: 792.003, 87.9652, 128, 128.07
50000: 647.992, 216, 128.015, 128
51000: 104, 375.999, 64, 64.0015
52000: 775.999, 375.995, 128.002, 128.009
53000: 1016, 536, 64, 63.9999
54000: 0, 676.03, 40.001, 72.0261
55000: 328.108, 504, 128, 128
56000: 120, 792, 128, 128.001
57000: 936, 840, 64.0001, 63.9999
58000: 444.416, 336.261, 295.168, 256
59000: 151.812, 0, 256, 176.017
60000: 0, 560, 272, 367
...

Do we miss any thing?
If the issue goes on, would you share the docker image you are using with us?

Thanks.

Hi, AastaLLL!

As you showed, this is the wrong result.

RetinaFace model is a face detection algorithm based on feature pyramid for point regression. It will output bounding box regression, confidence and face key point information for the preset Anchor at each position of the feature map. The output result of each preset box is 15 bits long, and the format is:

[2 location coordinates] [2 scale coordinates] Confidence [10 key points results]

Under normal test results, for most points, the original value of the confidence parameter should be similar to:


conf1: 4.210938; conf2: -4.187500

conf1&2 correspond to negative and positive examples, and after passing the SoftMax function, the final confidence will be much less than 0.02, and these results will be filtered out (see decode.cu, function CalDetection, Line 25~29):

float conf1 = cls_reg[idx + k * total_grid * 2];
float conf2 = cls_reg[idx + k * total_grid * 2 + total_grid];
// printf("conf1: %f; conf2: %f\n", conf1, conf2);
conf2 = expf(conf2) / (expf(conf1) + expf(conf2));
if (conf2 <= 0.02) continue;

Therefore, in pictures with not too many faces (or only one or two faces), the enqueue function will finally output dozens of candidate boxes, which are then further merged by algorithms such as NMS.

In the results you show, the enqueue function finally outputs 60701 candidate results (which means that almost no candidate box is screened out), and the regression coordinate corresponding to each result is also the preset value at that position, that is, there is no Make any return. If you directly add part of the code to print the output result of the enqueue function in the source code of the nvsdinfer plug-in or decode.cu, you will find that the fifth confidence parameter is almost all 0.5 (because conf1&2 are all 0 at this time).

Regarding the intermediate results of this part, I showed in this question before, but unfortunately I did not get a satisfactory answer. Hope that part of the information can give you some new ideas.

Or you can use DS’s built-in detection engint /samples/models/Primary_Detector/resnet10.caffemodel_b4_gpu0_int8.engine (that is default engine in sample config), or use cudaStreamDefault as I said, and compare the output results to know why I said that the results you are showing right now it’s wrong.

Thanks for your attention!!!

Hi,

Thanks for your explanation.
This issue can be produced currently and we are checking it with our internal team.
Will update more information with you once we got a feedback.

Thanks.

Hi,

We got some feedback from our internal team.

It looks like you didn’t attach the decode kernel to the same CUDA stream and lead to this issue.
After applying the patch below, we can see the bounding box can be generated normally.

diff --git a/nvdsinfer_custom_impl_RetinaFace/decode.cu b/nvdsinfer_custom_impl_RetinaFace/decode.cu
index b5c0f48..2d6e385 100644
--- a/nvdsinfer_custom_impl_RetinaFace/decode.cu
+++ b/nvdsinfer_custom_impl_RetinaFace/decode.cu
@@ -73,15 +73,15 @@ void DecodePlugin::forwardGpu(const float *const * inputs, float * output, cudaS
     totalCount += decodeplugin::INPUT_H / 16 * decodeplugin::INPUT_W / 16 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);
     totalCount += decodeplugin::INPUT_H / 32 * decodeplugin::INPUT_W / 32 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);
     for(int idx = 0 ; idx < batchSize; ++idx) {
-        cudaMemset(output + idx * totalCount, 0, sizeof(float));
+        cudaMemsetAsync(output + idx * totalCount, 0, sizeof(float), stream);
     }
 
     for (unsigned int i = 0; i < 3; ++i)
     {
         num_elem = batchSize * decodeplugin::INPUT_H / base_step * decodeplugin::INPUT_W / base_step;
         thread_count = (num_elem < thread_count_) ? num_elem : thread_count_;
-        CalDetection<<< (num_elem + thread_count - 1) / thread_count, thread_count>>>
-            (inputs[i], output, num_elem, base_step, base_anchor, totalCount);
+        CalDetection<<< (num_elem + thread_count - 1) / thread_count, thread_count, 0, stream>>>
+              (inputs[i], output, num_elem, base_step, base_anchor, totalCount);
         base_step *= 2;
         base_anchor *= 4;
     }

Thanks.

2 Likes

Hi, AastaLLL!

I’m sorry I didn’t check the information in time during a business trip last week.

After confirmation, I believe this is the answer I am looking for. Thank you very much for your patience and answers in this process. Best wishes for you!