Using Custom action recognition Model in Deepstream 3D action recognition

Hello,

Following is the complete information as applicable to my setup.

• Hardware Platform: Jetson Xavier AGX
• DeepStream Version: 6.0
• TensorRT Version: 8.0.1-1+cuda10.2
• Issue Type: questions
• Requirement details: Output shape is 0 while converting onnx file to engine for a custom action recognition model in Deepstream 3D action recognition.

My objective is to execute a custom action recognition model with temporal batching in Deepstream 6.0. As Deepstream 6.0 has provided CPP based sample app Deepstream 3d action recognition with temporal batch support, I am trying to execute a custom recognition model in the same sample app.

This is the input output layer shape of the custom ONNX model,

image1

When i use this same ONNX model in deepstream pipeline, It gets converted to .engine but it throws an Error: from element primary-nvinference-engine: Failed to create NvDsInferContext instance

If you see the input output shape of the converted engine below, It squeezes one dimension. Hence, Input shape of 1x3x16x224x224 becomes 3x16x224x224 and output shape of 1 becomes 0.

image2

Why this could be happening? Is this expected behaviour?

Please help me out with the same.

One workaround i have came across is to unsqueeze the output to add another dimension to it’s shape and save ONNX model. Thus, output shape of [1] becomes [1,1]. Something like this,

image3

Following above approach generates .engine file with below shape,

image4

and this does not throw any errors and model is getting executed but is not yielding the correct output.

My issues are below,

1.) How to use the custom model without squeezing another dimension to its output? If Deepstream is squeezing a dimension while loading engine file, How to use a model which has only single dimension shape? Resolution to this issue is on priority.
2.) As i am unable to execute the custom model as is (without unsqeezing the output), I am adding another dimension to the output to execute the custom model which generates incorrect output results. Is this because of the unsqueeze operation which i have added in output?

Would really appreciate the resolutions to above listed issues.

Do let me know if you need mode clarity on this.

Thanks & Regards,
Hemang Jethava

1 Like

Hi,

First, please note that we have a newer JetPack 4.6.1 and Deepstream 6.0.1 release.
It’s always recommended to upgrade your environment to the latest for better experiences.

For your model, would you mind telling us what is your input?
Do you concatenate 16 images that have 3 channels and the resolution is 224x224?

Since Deepstream is a camera pipeline, we will need to check if this use case is supported.
Before this, would you mind running your model with trtexec to get the input/output shape for us?

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --dumpOutput
&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --dumpOutput
...
[03/25/2022-10:26:45] [I] Engine built in 7.93155 sec.
[03/25/2022-10:26:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1363, GPU 7256 (MiB)
[03/25/2022-10:26:45] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1363, GPU 7256 (MiB)
[03/25/2022-10:26:45] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[03/25/2022-10:26:45] [I] Using random values for input Input3
[03/25/2022-10:26:45] [I] Created input binding for Input3 with dimensions 1x1x28x28
[03/25/2022-10:26:45] [I] Using random values for output Plus214_Output_0
[03/25/2022-10:26:45] [I] Created output binding for Plus214_Output_0 with dimensions 1x10
[03/25/2022-10:26:45] [I] Starting inference
[03/25/2022-10:26:48] [I] Warmup completed 2079 queries over 200 ms
[03/25/2022-10:26:48] [I] Timing trace has 32153 queries over 3.00021 s
...

Thanks.

Hi @AastaLLL

The input of the model is 1x3x16x224x224. Yes, Your understanding is correct; We do concatenate 16 images that have 3 channels and the resolution is 224x224. This is similar to the action recognition 3d sample app’s 3d model’s input.

I have run the model with trtexec to get the input/output shape which you can find below,

$ /usr/src/tensorrt/bin/trtexec --onnx=bs1_march22_fd.onnx --dumpOutput
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=bs1_march22_fd.onnx --dumpOutput
...
[03/25/2022-11:45:07] [I] Engine built in 29.1877 sec.
[03/25/2022-11:45:07] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1362 MiB, GPU 24932 MiB
[03/25/2022-11:45:07] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +7, now: CPU 1362, GPU 24939 (MiB)
[03/25/2022-11:45:07] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1362, GPU 24949 (MiB)
[03/25/2022-11:45:07] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1364 MiB, GPU 25060 MiB
[03/25/2022-11:45:07] [I] Created input binding for input with dimensions 1x3x16x224x224
[03/25/2022-11:45:07] [I] Created output binding for output with dimensions 1
[03/25/2022-11:45:07] [I] Starting inference
[03/25/2022-11:45:11] [I] Warmup completed 3 queries over 200 ms
[03/25/2022-11:45:11] [I] Timing trace has 37 queries over 3.20196 s
[03/25/2022-11:45:11] [I] 
[03/25/2022-11:45:11] [I] === Trace details ===
[03/25/2022-11:45:11] [I] Trace averages of 10 runs:
[03/25/2022-11:45:11] [I] Average on 10 runs - GPU latency: 86.5952 ms - Host latency: 86.9179 ms (end to end 87.02 ms, enqueue 35.0739 ms)
[03/25/2022-11:45:11] [I] Average on 10 runs - GPU latency: 84.4766 ms - Host latency: 84.802 ms (end to end 84.8117 ms, enqueue 34.946 ms)
[03/25/2022-11:45:11] [I] Average on 10 runs - GPU latency: 85.9088 ms - Host latency: 86.2366 ms (end to end 86.2853 ms, enqueue 35.4646 ms)
[03/25/2022-11:45:11] [I] 
[03/25/2022-11:45:11] [I] === Performance summary ===
[03/25/2022-11:45:11] [I] Throughput: 11.5554 qps
[03/25/2022-11:45:11] [I] Latency: min = 80.5767 ms, max = 91.0549 ms, mean = 86.4912 ms, median = 88.2876 ms, percentile(99%) = 91.0549 ms
[03/25/2022-11:45:11] [I] End-to-End Host Latency: min = 80.5886 ms, max = 91.0654 ms, mean = 86.5394 ms, median = 88.3589 ms, percentile(99%) = 91.0654 ms
[03/25/2022-11:45:11] [I] Enqueue Time: min = 19.5367 ms, max = 52.9646 ms, mean = 35.77 ms, median = 44.0358 ms, percentile(99%) = 52.9646 ms
[03/25/2022-11:45:11] [I] H2D Latency: min = 0.260986 ms, max = 0.385742 ms, mean = 0.325974 ms, median = 0.374298 ms, percentile(99%) = 0.385742 ms
[03/25/2022-11:45:11] [I] GPU Compute Time: min = 80.1975 ms, max = 90.7883 ms, mean = 86.164 ms, median = 88.0197 ms, percentile(99%) = 90.7883 ms
[03/25/2022-11:45:11] [I] D2H Latency: min = 0.000976562 ms, max = 0.0032959 ms, mean = 0.00118194 ms, median = 0.00109863 ms, percentile(99%) = 0.0032959 ms
[03/25/2022-11:45:11] [I] Total Host Walltime: 3.20196 s
[03/25/2022-11:45:11] [I] Total GPU Compute Time: 3.18807 s
[03/25/2022-11:45:11] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/25/2022-11:45:11] [I] 
[03/25/2022-11:45:11] [I] Output Tensors:
[03/25/2022-11:45:11] [I] output: (1)
[03/25/2022-11:45:11] [I] 0
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=bs1_march22_fd.onnx --dumpOutput
[03/25/2022-11:45:11] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1362, GPU 24981 (MiB)

As you can see the input output shape,

Created input binding for input with dimensions 1x3x16x224x224
Created output binding for output with dimensions 1

This trtexec provides correct i/o shape unlike the .engine model shape in Deepstream pipeline.

I hope this gives you better sense of the issue.

How do i resolve this?

Inline to above approach, I had saved the engine using trtexec which has the correct input output shape. However, when i load the same engine file in Deepstream pipeline It gives the same error as earlier.

$ sudo deepstream-3d-action-recognition -c deepstream_action_recognition_config.txt 
num-sources = 1
Now playing: file:///test.mp4,

Using winsys: x11 
0:00:03.249616116 16276   0x558fd2a810 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1900> [UID = 1]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-6.0/sources/apps/sample_apps/h_work_deepstream-3d-action-recognition/try.engine
INFO: [Implicit Engine Info]: layers num: 2
0   INPUT  kFLOAT input           3x16x224x224    
1   OUTPUT kINT32 output          0               

0:00:03.249795388 16276   0x558fd2a810 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2004> [UID = 1]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-6.0/sources/apps/sample_apps/h_work_deepstream-3d-action-recognition/try.engine
0:00:03.253819158 16276   0x558fd2a810 ERROR                nvinfer gstnvinfer.cpp:632:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Error in NvDsInferContextImpl::allocateBuffers() <nvdsinfer_context_impl.cpp:1430> [UID = 1]: Failed to allocate cuda output buffer during context initialization
0:00:03.253901626 16276   0x558fd2a810 ERROR                nvinfer gstnvinfer.cpp:632:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Error in NvDsInferContextImpl::initialize() <nvdsinfer_context_impl.cpp:1280> [UID = 1]: Failed to allocate buffers
0:00:03.268127274 16276   0x558fd2a810 WARN                 nvinfer gstnvinfer.cpp:841:gst_nvinfer_start:<primary-nvinference-engine> error: Failed to create NvDsInferContext instance
0:00:03.268186509 16276   0x558fd2a810 WARN                 nvinfer gstnvinfer.cpp:841:gst_nvinfer_start:<primary-nvinference-engine> error: Config file path: config_infer_primary_3d_action.txt, NvDsInfer Error: NVDSINFER_CUDA_ERROR
Running...
ERROR from element primary-nvinference-engine: Failed to create NvDsInferContext instance
Error details: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(841): gst_nvinfer_start (): /GstPipeline:preprocess-test-pipeline/GstNvInfer:primary-nvinference-engine:
Config file path: config_infer_primary_3d_action.txt, NvDsInfer Error: NVDSINFER_CUDA_ERROR
Returned, stopping playback
Deleting pipeline

Please help me out with the same.

Thanks & Regards,
Hemang Jethava

Hi,

We want to check this further.
Would you mind sharing the model and source with us for reproducing?

Thanks.

Hi,

We try to compare your model with our Action Recognition Net from NGC.
But found an issue with your model. Would you mind sharing more information with us?

The question is related to the input/output dimension.
For a standard 3D model:

Expected
Input: [batch, channel, #frames, height, width]
Output: [batch, #class]

NGC’s model

[04/08/2022-05:39:08] [I] Created input binding for input_rgb with dimensions 1x3x32x224x224
[04/08/2022-05:39:08] [I] Using random values for output fc_pred
[04/08/2022-05:39:08] [I] Created output binding for fc_pred with dimensions 1x5

Yours

[04/08/2022-05:40:37] [I] Created input binding for input with dimensions 1x3x16x224x224
[04/08/2022-05:40:37] [I] Using random values for output output
[04/08/2022-05:40:37] [I] Created output binding for output with dimensions 1

It seems that the output dimension from your model is shrinking to only one dimension.
But it should be 1x1 to match the [batch, #class] format.

Does the batch size of your model be fixed to 1?

Thanks.

Hi @AastaLLL

As of now, We are keeping the batch size fixed to 1. The 3d model which is being used is generating the output with single dimension only.

You can refer to similar example of such models from here.

Let me know if you need further details.

Thanks,
Hemang

1 Like

Hi @AastaLLL

Just to add, This custom action recognition model returns 0 or 1(False or True) and not #class.

Hence, Even if we keep the batch size>1 (eg. 10) then it would basically return a 1-D with shape of the Batch size (eg. [10]).

Thus, Unlike standard 3d model: This custom 3d model would return True or False and not #Class. We do the indexing to extract the batch wise output results.

for example, If input shape is 10x3x16x224x224 then output shape would be of 10 only.

Thanks,
Hemang

1 Like

Hi,

Thanks for your update.

The problem is that Deepstream expects to get [batch, 1] output dimension.
But in the ONNX model, it is [batch] which means the second dimension is omitted.
We are working on this to see how to handle this use case.

For your INT64 model, have you verified the accuracy with PyTorch or ONNXRuntime before?

Thanks.

Hi

The original model is with Boolean datatype output in PyTorch and we changed it to int64 datatype output for ONNX model to run on Deepstream pipeline as boolean datatype output is not supported.

For INT64 model, We haven’t verified the accuracy for squeezed INT64 model(The model with [1] output Shape). But for unsqueezed INT64 model (The model with [[1]] output and ((1x1) shape), We have checked the accuracy with PyTorch. It yields the same outputs as Boolean datatype output model’s execution in PyTorch pipeline. This was the comparison between same custom model with different output shape and dtype on PyTorch pipeline.

I hope this is what you were looking for.

Thanks,
Hemang

Hi,

We write a patch to handle this output omitted use case.
Since the output is set to [batch, None] in the ONNX model, we hardcoded the output into [batch, 1] if only one dimension is given.

With the patch, we can run your INT64 model successfully with deepstream-3d-action-recognition.
Since we only have a random weight model, please help to verify the accuracy.

nvdsinfer.patch (613 Bytes)

$ cd /opt/nvidia/deepstream/deepstream-6.0/sources/libs/nvdsinfer
$ git apply nvdsinfer.patch
$ CUDA_VER=10.2 && sudo CUDA_VER=10.2 make install

Thanks.

Hi @AastaLLL

This patch is working and we are able to execute the INT64 model without unsqueezing the last output. However, We are not able to get the correct output from the execution.

More details about the model,
If you refer the INT64 model architecture with Netron, You’ll see that it is an Ensemble model of 3d model and 2d model. We are multiplying the output of both the models and returning that as final output. Thus, a specific activity is classified 1 from both the models if that activity is happening and We’ll get 1 (True) as an output from the ensemble model.

My observation is that if we do the multiplication and return single output, Model is not yielding the correct output and it always returns 0 (False).

Let me know your thoughts on the same.

Thanks,
Hemang

Hi,

It sounds like this issue is from TensorRT directly.

Would you mind helping to check if your model can run correctly with ONNXRuntime?
For example, with this inference script: ort_inference.py (645 Bytes)

If yes, it’s easier for us to debug by comparing the layer output between ONNXRuntime and TensorRT.
Thanks.

Hi @AastaLLL

Yes. Somehow, I was unable to get CUDAExecutionProvider properly installed even after installing all the dependencies. Assuming this would have considered CPUExecutionProvider as default.

Below is the output of the same.

$ python3 ort_inference.py
2022-04-18 07:05:10.583973359 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:552 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
Input: input, size=['temporal_batch_size', 3, 16, 224, 224]
Output: output, size=['Muloutput_dim_0']
Input shape after concatenating two test_x inputs: (2, 3, 16, 224, 224)
Output: [0 0]

I hope this helps.

Thanks,
Hemang

Hi,

Do you test this on the trained model?
More, would you mind also testing this on the bool version model with the real input?

We test a random input (range from 0 to 255) on the INT64 and bool models.
Both are always returning True/1.

Not sure if there is an issue when converting the model into ONNX format.
Is it possible that share the PyTorch model and PyTorch inference script with us to debug?

Thanks.

Hi @AastaLLL

Yes, We have test this on the trained model but with numpy zeros. Will try with original set of images.

I haven’t tried with bool version model. The patch file which you provided has support for boolean datatype as well? I will check from my end as well.

I will cross check this as well. We are dumping the same model arch with weight in torch script format and onnx format as of now and torch script model seems to be working fine.

Let me get back to you on this.

Thanks for the support.
Hemang

Hi,

The ort_inference.py script uses ONNXRuntime which is supported bool output type.
Thanks.

Hi,

Okay, We have validated that bool output type model is getting executed with PyTorch (Actual pipeline) and ONNX Inference framework (Dummy data) but not in DS pipeline as we get unknown datatype for output later ERROR. Just FYI

Thanks.

Hi,

Do you get the expected output with ONNX inference on the boolean model?

If yes, this indicates that the inference is correct with ONNXRuntime + bool model.
So we can use the model as ground truth to compare the TensorRT/Deepstream result.

Thanks.

Hi @AastaLLL

I have tested with dummy data only for ONNX and PyTorch inference inference on original data, I have to make some changes to my PyTorch pipeline to execute the ONNX inference on original data.

I am planning to use PyTorch Inference output as ground truth to compare the TRT/DS results.

I have other priority task with me right now, Will update the outcome of above experiment once I conclude that in couple of days.

Thanks,
Hemang

Thanks for the update.
It is also good for us if there is a PyTorch ground-truth to compare with.