Parsing output issue

edit_or · September 23, 2020, 8:57am

My network has customized plugin layer.
Plugin layer output is packed in one buffer according to batch size.
For batch size 4, output buffer at plugin is arranged as 20*4 size as
| 0…result0…19 | 0…result1…19 | 0…result2…19 | 0…result3…19 |

Plugin layer is output of the network. So I need a custom parser function.
But custom parsing function

extern "C"
bool NvDsInferClassiferParseCustomSoftmax (std::vector<NvDsInferLayerInfo> const &outputLayersInfo,
        NvDsInferNetworkInfo  const &networkInfo,
        float classifierThreshold,
        std::vector<NvDsInferAttribute> &attrList,
        std::string &descString)
{

}

is called 4 times with std::vector<NvDsInferLayerInfo> const &outputLayersInfo vector size 20.

My plugin is working correctly in TensorRT and had correct results.
In Deepstream, I don’t have correct output.

How my plugin layer’s output is splitted into four vectors? How can I have correct results?

I am using AGX Xvier, Jetpack4.4, Deepstream 5.0, TensorRT7.1, CUDA10.2.

AastaLLL · September 24, 2020, 2:50am

Hi,

Is this topic duplicate to Parsing output with custom parser has error in decoding?

First, are you able to feed the same input into TensorRT and Deepstream?
If yes, would you mind to share the output of both with us first?
We want to check the output is corrupted or just re-arranged.

Thanks

edit_or · September 24, 2020, 3:26am

Yes they are duplicate. I like to make clearer.

This is TensorRT input for batchsize 10.

Dims NumPlateRecognition::loadJPEGFile(std::vector<std::string> fileName, int num)
{
    Dims4 inputDims{num, 24, 94, 3};
    Dims4 inputDims_1img{1, 24, 94, 3};
    const size_t vol = samplesCommon::volume(inputDims);
    const size_t vol_1img = samplesCommon::volume(inputDims_1img);
    unsigned char *data = new unsigned char[vol];
    for(int f=0; f < num; f++){
       cv::Mat image, im_rgb;
       image = cv::imread(fileName[f], cv::IMREAD_COLOR);
       cv::cvtColor(image, im_rgb, cv::COLOR_BGR2RGB);
       image.release();
       memcpy(data+(f*vol_1img), im_rgb.ptr<unsigned char>(), vol_1img);
       im_rgb.release();
       mInput.hostBuffer.resize(inputDims);
       float* hostDataBuffer = static_cast<float*>(mInput.hostBuffer.data());
       std::transform(data, data+vol, hostDataBuffer, [](uint8_t x) { return (static_cast<float>(x) / 255.0); });      
    }
    delete[] data;
    return inputDims;    
}

Output from the last customized plugin layer CTCGreedyDecoder is

26 20 12 8 5 3 1 33 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 20 14 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 7 5 6 0 27 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 16 11 14 8 5 5 3 27 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 20 30 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 17 13 1 1 2 9 25 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 19 26 2 8 3 8 16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 19 27 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 13 26 7 7 5 1 32 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 15 10 9 3 14 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

That is 20 x 10 long buffer. Output parser in TensorRT sample is quite straightforward. Go through every 20 data and decode to Characters. It worked correctly with good accuracy in TensorRT.

For Deepstream.
net-scale-factor=0.0039215697906911373, that is 1/255.
Inside nvdsinfer_context_impl.cpp

NvDsInferStatus InferPreprocessor::transform(NvDsInferContextBatchInput& batchInput, void* devBuf, CudaStream& mainStream, CudaEvent* waitingEvent)
{
     convertFcnFloat(outPtr, (float *)batchInput.inputFrames[i],
                m_NetworkInfo.width, m_NetworkInfo.height, batchInput.inputPitch,
                m_Scale, m_MeanDataBuffer.get() ? m_MeanDataBuffer->ptr<float>() : nullptr,
                *m_PreProcessStream);
}

Assumed it is same as x 1/255 as I don’t have meandata.

model-color-format=0 (RGB format, same as in TensorRT)
infer-dims=24;94;3 (same as in TensorRT)

The whole config file is as follows.

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
onnx-file=../../../../samples/models/platerect/numplate_recg_nhwc_removed_sparsetodense.onnx
model-engine-file=../../../../samples/models/platerect/numplate_recg_nhwc_removed_sparsetodense.onnx_batch_max10_gpu0_fp16.engine
#mean-file=../../../../samples/models/Secondary_CarColor/mean.ppm
#labelfile-path=../../../../samples/models/platerect/labels.txt
#int8-calib-file=../../../../samples/models/Secondary_CarColor/cal_trt.bin
infer-dims=24;94;3
force-implicit-batch-dim=0
batch-size=10
# 0=FP32 and 1=INT8 mode
network-mode=2
input-object-min-width=20
input-object-min-height=10
process-mode=2
model-color-format=0
gpu-id=0
gie-unique-id=2
operate-on-gie-id=1
operate-on-class-ids=1
network-type=1
parse-classifier-func-name=NvDsInferParseCustomCTCGreedy
custom-lib-path=/usr/src/tensorrt/CTCGreedyDecoder_Plugin/build/libCTCGreedyDecoder.so
output-blob-names=trest:0
classifier-threshold = 0

Then output at CTCGreedyDecoder(printed at CTCGreedyDecoder plugin) for batchsize 3 is
26 20 26 26 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 11 26 11 11 11 11 11 11 11 11 11 14 -1 -1 -1 -1 -1 -1 -1 26 19 26 17 9 8 17 5 9 5 4 17 9 26 -1 -1 -1 -1 -1 -1
So total is 3x20 number of data in output buffer.
It looks corrupted at CTCGreedyDecoder output. Second output is always
26 11 11 11 11 11 11 11 11 11 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
even though input image is changed.

Output at CTCGreedyDecoder and data at customized parser are totally different.

Output at CTCGreedyDecoder
26 20 26 26 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 11 26 11 11 11 11 11 11 11 11 11 14 -1 -1 -1 -1 -1 -1 -1 26 19 26 17 9 8 17 5 9 5 4 17 9 26 -1 -1 -1 -1 -1 -1

When data is checked at customized parser (parser is called three times for batch size 3)

dims 20 (chopped into one image data from batch size 3)
0 2 0 2 0 2 0 2 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1

dims 20 (chopped into one image data from batch size 3)
0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1

dims 20 (chopped into one image data from batch size 3)
0 2 0 2 0 2 0 2 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1

They are totally different.

Then data at parser are always same, never changed. Even though, CTCGreedyDecoder plugin output has changes for different image inputs.

AastaLLL · September 25, 2020, 2:31am

Hi,

Guess that the input order is misused somewhere.
This may happen in some usage of Deepstream that force the dimension of C channel.

We will try to reproduce this and update here later.
Thanks.

edit_or · September 25, 2020, 3:36am

Thanks. May I know how long to test and feedback? I’m in the middle of development for the project demo. Thank you.

PhongNT · September 27, 2020, 4:56am

Are you have many update?

edit_or · September 28, 2020, 3:17am

Not yet, waiting for the test and reply from @AastaLLL

AastaLLL · September 28, 2020, 6:59am

Hi, both

Will update here once we get any progress.
Thanks.

AastaLLL · September 29, 2020, 7:21am

Hi,

Please correct us if we don’t understand your problem correctly.
It seems that the output for batch id=0 is correct.
But the output for batch id=1 to id=2 is incorrect and repeated.

Based on this, a possible issue is that the batch is not correctly formatted.
So the issue should occur in the input source rather than inference.

In general, we use multiple source and each source feed into a specific batch location.
Instead of deepstream-test2, would you mind to try multi-source with deepstream-app directly.

Ex.

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI 4=RTSP
type=3
uri=file://../../streams/sample_1080p_h264.mp4
num-sources=12
#drop-frame-interval=2
gpu-id=0
# (0): memtype_device   - Memory type Device
# (1): memtype_pinned   - Memory type Host Pinned
# (2): memtype_unified  - Memory type Unified
cudadec-memtype=0

Thanks.

edit_or · September 30, 2020, 2:49am

I have also tested with deepstream-app. I had the same issue.
Yes id0 is ok and 1 and 2 are wrong.

Then at custom parser doesn’t have same data as CTCGreedyDecoder’s output.
These are the two issues.

How to proceed to solve the issues?

If you think input has issue, what should I look at?
The preprocessing below, am I doing ok with my configs shown in my original post?

NvDsInferStatus InferPreprocessor::transform(NvDsInferContextBatchInput& batchInput, void* devBuf, CudaStream& mainStream, CudaEvent* waitingEvent)
{
     convertFcnFloat(outPtr, (float *)batchInput.inputFrames[i],
                m_NetworkInfo.width, m_NetworkInfo.height, batchInput.inputPitch,
                m_Scale, m_MeanDataBuffer.get() ? m_MeanDataBuffer->ptr<float>() : nullptr,
                *m_PreProcessStream);
}

This issue is happening in second gie.

edit_or · September 30, 2020, 3:07am

I think I am almost finishing my project demo and my team has agreed using Deepstream for my project as I showed them it is working in TensorRT.

edit_or · September 30, 2020, 3:15am

I have two issues. CTCGreedyDecoder output has wrong data outputs.
You think input data has some issue.
How should I look at to correct it?

Another one is using custom parser. Custom parser data is totally different from CTCGreedyDecoder output.
When I look at raw-output-generated-callback whether I can have raw output from CTCGreedyDecoder, I have

Unknown or legacy key specified 'raw-output-generated-callback' for group [property]
Unknown or legacy key specified 'raw-output-file-write' for group [property]

What could be issue?

AastaLLL · September 30, 2020, 3:52am

Hi,

Would you mind to try the multiple source shared above?
The configure feeds the batch with the same video and is verified by our QA team.

This could help us figure out if the cause comes from input.

Thanks.

edit_or · September 30, 2020, 3:56am

Do you mean deepstream-app, I have tried, same issue. Do you want to see outputs and config files?

AastaLLL · September 30, 2020, 4:09am

Hi,

YES. Please.

edit_or · September 30, 2020, 4:17am

Output from deepstream-app for batch size 10.
mInputDims from enqueue 88 10 48
mOutputDims from enqueue 10 20
before print
26 20 29 19 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 20 29 19 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 20 26 20 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 20 26 20 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 20 26 20 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 26 11 11 11 11 11 11 11 11 11 11 11 11 11 -1 -1 -1 -1 -1 -1 26 11 11 11 11 11 11 11 11 11 11 11 11 14 -1 -1 -1 -1 -1 -1 26 11 11 11 11 11 11 11 11 11 11 11 11 14 -1 -1 -1 -1 -1 -1 26 11 11 11 11 11 11 11 11 11 11 11 3 14 -1 -1 -1 -1 -1 -1 26 11 11 11 11 11 11 11 11 11 11 -1 -1 -1 -1 -1 -1 -1 -1 -1

They are wrong 26 11 11 11 11 11 11 11 11 11 11 11 11 11 -1 -1 -1 -1 -1 -1 26 11 11 11 11 11 11 11 11 11 11 11 11 14 -1 -1 -1 -1 -1 -1 26 11 11 11 11 11 11 11 11 11 11 11 11 14 -1 -1 -1 -1 -1 -1 26 11 11 11 11 11 11 11 11 11 11 11 3 14 -1 -1 -1 -1 -1 -1 26 11 11 11 11 11 11 11 11 11 11 -1 -1 -1 -1 -1 -1 -1 -1 -1

Custom parser also wrong.

dims 20
decoded as 020202020
dims 20
decoded as 0
dims 20
decoded as 020202020
dims 20
decoded as 0
dims 20
decoded as 020202020
dims 20
decoded as 0
dims 20
decoded as 020202020
dims 20
decoded as 0
dims 20
decoded as 020202020
dims 20
decoded as 0

Config files are numplate.txt (5.2 KB) numplate_pgieconfig.txt (4.1 KB) numplate_sgieconfig.txt (3.7 KB)

edit_or · October 2, 2020, 3:44am

May I know any feedback on the issues?

edit_or · October 6, 2020, 1:23am

Any updates on this? Should I go back to TensorRT?

AastaLLL · October 6, 2020, 6:45am

Hi,

Sorry that we are still checking this.
Will update more information with you later.

Thanks.

AastaLLL · October 7, 2020, 4:54am

Hi,

Sorry for keeping you waiting.

We found a possible issue of this problem.
It seems that the batchsize of streammux and primary-gie is set to 1 in numplate.txt.

These two components control the input data of secondary classifier.
Would you mind to update it into 10 and try it again?

[streammux]
...
batch-size=10
...

[primary-gie]
...
batch-size=10
...

Thanks.