Openvx/Visionworks graph input from GPU memory Buffer

Description

Hello,

I am facing a problem with Siamrpn++ inferencing with tensorrt. This problem is present for
many similar network architecture and is, from what i learned, due to the non-support of the cross-correlation between two dynamic input by tensorrt (tensorrt seems to require a static kernel for this kind of operations).

To solve this issue the solutions found yet are to crop the network architecture before the cross-correlation operation and to do those either manually or using other framework.

One functionnal solution was to couple tensorrt with a second inference engine (onnxruntime) which support this operation. However for unknown reasons performances were terrible when working with onnxruntime (more than 500ms for a single inference while having around 60ms on the pc-based version for the whole network. the hardware difference justify a part of the gap still the gap is not coherent between what was seen on resnet-50 for comparison).

An other solution seem to reimplement the operation in Cuda which seems particularly time consuming and not portable.
The last solution which I am exploring is using openvx/visionworks and the vxMatchTemplate node to implement the mentioned operation.

IN the scope of the last solution I am trying to setup Visionworks image memory to an already allocated GPU buffer (output of tensorrt) but couldn’t find how to do so.
It seems vxCreateImageFromHandle would be a good starting point but the Memory type parameter dontains only Host and None type which doesn’t seems to correspond to GPU memory (thought memory is shared I don’t think pointer are interchangeable this way).

So the question is is there a way to do this in a correct way ? also if you have any recommandation concerning the whole problematic mentionned above, that would be greatly appreciated.

Thank you,
Regards.

A clear and concise description of the bug or issue.

Environment

TensorRT Version : 7.1.3-1
GPU Type : jetson TX2 gpu
Nvidia Driver Version :
CUDA Version : 10.2
CUDNN Version : 8.0.0.1810-1
Operating System + Version : l4t R32
Python Version (if applicable) :
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) :
Baremetal or Container (if container which image + tag) :

hi armand.zampierizn4wa:
the better ways to resolve it maybe write a plugin for TRT to support your cross-correlation op
reference: onnx2trt - Depthwise Cross Correlation - Deep Learning (Training & Inference) / TensorRT - NVIDIA Developer Forums

Hi Jeffli,

Thank you for your answer,
Will it works though ? The original layer that does the problematic correllation is a conv2d in pytorch, and a simple conv in the onnx conversion which should be supported by tensorrt according to Support Matrix :: NVIDIA Deep Learning TensorRT Documentation
Before engaging costly and risky developpement I would like to be sure that the custom layer will act better than the original one. Why is the operation not supported when applying convolution using a dynamic kernel ?
If you want to reproduce the problem you can find here the head of the onnx that I am trying to convert using trtexec.
rpn_head_2.onnx (28.6 MB)

The exact command line used to do so is:
trtexec --onnx=<…>/rpn_head_2.onnx --saveEngine=<…>/rpn_head_2.engine --shapes=input_1:3x1x256x7x7,input_2:3x1x256x31x31 --verbose

I tried with part of the model and the full model for the same results, tensorrt converter fails while counting 0 weight for this kernel (which partially make sense since the weight comes from the second input)

Thank you for your answer

By the way this issue is almost identical to the problem mentionned here:

but does not have a solution yet. The cross-convolution layer is in the core of the network and can not be efficiently replaced.

hi armand.zampierizn4wa:
I reproduced issue with model you provided , error is :
TensorRT only supports multi-input conv for explicit precision QAT networks
simliar issues here: caused by con2d which two input ,but TRT not support now

con2d by TRT with multi-input is only support by quantized networks
another discuss about some workarounds
https://githubmemory.com/repo/onnx/onnx-tensorrt/issues/609
seems this is NOT such easy to resolve if you ONLY have onnx model ,
if you have resource code ,try to stead con2d with some other OP to debug this

Hello,

As of now I couldn’t sove this issue, unfortunately quantization (Using the onnx quantizer) didn’t worked and raised other issue, (same problem appeared with quantization aware training, inwhich some layers (QuantizeLinear) also depend on a dynamic kernel (y_scale), this may be solved by using static quantization, hadn’t the time to investigate further).
In addition the quantization from onnx replacing conv2d with convInteger is likely to produce the same error.
Modifying the pytorch source (GitHub - PengBoXiangShang/SiamRPN_plus_plus_PyTorch: SiamRPN, SiamRPN++, unofficial implementation of "SiamRPN++" (CVPR2019), multi-GPUs, LMDB.) doesn’t seem to me like a solution since, the convolution using the features of the target image on the source image is in the core of the network and of most single object tracking network (except goturn for example). To be more specific I don’t see any equivalent operation that doesn’t require a dynamic kernel; goturn uses fully connected layer as the head of the network which in fact could replace the convolution but the price is adding a huge amount of unusfull operations int he process (you could in theory replace all convolution with fully connected layers but the performance would be terrible)
Concerning the workaround, it’s not adapted in our case since it assumes a 1x1 kernel which in our case (image correlation) will raise noise and greatly degrade the tracker performances.
The absence of possibility to apply this operation is a huge surprise and is particularly disapointing. This kind of operation is very basic in computer vision and was in use way before the neural network “boom”.

From what I saw only this article :
http://wintics.com/fr/building-smart-camera-applications-at-an-industrial-scale-by-leveraging-cutting-edge-deep-learning-techniques-2/
affirm to have found a workaround, but this one is not detailed and probably proprietary.

If facing this issue I would recommand switching to a detection based tracking or a goturn architecture as of now until tensorrt support is added. It is also possible to implement the missing layers out of tensorrt or switch framework in between but also the last seems time-consuming and will reduce inference timing performance. (tried with onnxruntime with poor performance and opened a ticket there to understand the problem: onnxruntime Jetson tx2 cuda · Issue #8771 · microsoft/onnxruntime · GitHub)

If you found any path of exploration, don’t hesitate to mention it, but as stated the issue is not trivial.
Hope this help and good luck.