Openvx/Visionworks graph input from GPU memory Buffer

armand.zampierizn4wa · July 8, 2021, 12:41pm

Description

Hello,

I am facing a problem with Siamrpn++ inferencing with tensorrt. This problem is present for
many similar network architecture and is, from what i learned, due to the non-support of the cross-correlation between two dynamic input by tensorrt (tensorrt seems to require a static kernel for this kind of operations).

To solve this issue the solutions found yet are to crop the network architecture before the cross-correlation operation and to do those either manually or using other framework.

One functionnal solution was to couple tensorrt with a second inference engine (onnxruntime) which support this operation. However for unknown reasons performances were terrible when working with onnxruntime (more than 500ms for a single inference while having around 60ms on the pc-based version for the whole network. the hardware difference justify a part of the gap still the gap is not coherent between what was seen on resnet-50 for comparison).

An other solution seem to reimplement the operation in Cuda which seems particularly time consuming and not portable.
The last solution which I am exploring is using openvx/visionworks and the vxMatchTemplate node to implement the mentioned operation.

IN the scope of the last solution I am trying to setup Visionworks image memory to an already allocated GPU buffer (output of tensorrt) but couldn’t find how to do so.
It seems vxCreateImageFromHandle would be a good starting point but the Memory type parameter dontains only Host and None type which doesn’t seems to correspond to GPU memory (thought memory is shared I don’t think pointer are interchangeable this way).

So the question is is there a way to do this in a correct way ? also if you have any recommandation concerning the whole problematic mentionned above, that would be greatly appreciated.

Thank you,
Regards.

A clear and concise description of the bug or issue.

Environment

TensorRT Version : 7.1.3-1
GPU Type : jetson TX2 gpu
Nvidia Driver Version :
CUDA Version : 10.2
CUDNN Version : 8.0.0.1810-1
Operating System + Version : l4t R32
Python Version (if applicable) :
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) :
Baremetal or Container (if container which image + tag) :

Jeffli · July 12, 2021, 3:53am

hi armand.zampierizn4wa:
the better ways to resolve it maybe write a plugin for TRT to support your cross-correlation op
reference: onnx2trt - Depthwise Cross Correlation - Deep Learning (Training & Inference) / TensorRT - NVIDIA Developer Forums

armand.zampierizn4wa · July 12, 2021, 4:00pm

Hi Jeffli,

Thank you for your answer,
Will it works though ? The original layer that does the problematic correllation is a conv2d in pytorch, and a simple conv in the onnx conversion which should be supported by tensorrt according to Support Matrix :: NVIDIA Deep Learning TensorRT Documentation
Before engaging costly and risky developpement I would like to be sure that the custom layer will act better than the original one. Why is the operation not supported when applying convolution using a dynamic kernel ?
If you want to reproduce the problem you can find here the head of the onnx that I am trying to convert using trtexec.
rpn_head_2.onnx (28.6 MB)

The exact command line used to do so is:
trtexec --onnx=<…>/rpn_head_2.onnx --saveEngine=<…>/rpn_head_2.engine --shapes=input_1:3x1x256x7x7,input_2:3x1x256x31x31 --verbose

I tried with part of the model and the full model for the same results, tensorrt converter fails while counting 0 weight for this kernel (which partially make sense since the weight comes from the second input)

Thank you for your answer

By the way this issue is almost identical to the problem mentionned here:

but does not have a solution yet. The cross-convolution layer is in the core of the network and can not be efficiently replaced.

Jeffli · July 27, 2021, 10:06am

hi armand.zampierizn4wa:
I reproduced issue with model you provided , error is :
TensorRT only supports multi-input conv for explicit precision QAT networks
simliar issues here: caused by con2d which two input ,but TRT not support now

github.com/onnx/onnx-tensorrt

[8] Assertion failed: ctx->network()->hasExplicitPrecision() && "TensorRT only supports multi-input conv for explicit precision QAT networks!"

opened 03:55AM - 19 Feb 21 UTC

closed 06:44PM - 21 Mar 22 UTC

jinfagang

duplicate triaged

Hi, I try to convert a model of onnx with normal data type float32 not QAT model…s. But it gives me this error message: ``` [8] Assertion failed: ctx->network()->hasExplicitPrecision() && "TensorRT only supports multi-input conv for explicit precision QAT networks!" ``` And I can reproduce this error with this minimal code: ```python class MG(nn.Module): def __init__(self): super().__init__() # for test if torch.cat([bool, bool]) can convert def forward(self, x, b): # x, b = x preds = F.conv2d(x, b, stride=1) preds = preds.to(torch.float) preds = preds.sigmoid().float() seg_masks = preds > torch.tensor(0.03, dtype=torch.float) return seg_masks torch_model = MG() x = torch.randn([1, 4, 24, 24]) b = torch.randn([8, 4, 3, 3]) torch_out = torch_model(x, b) # Export the model torch.onnx.export(torch_model, # model being run (x, b), "a.onnx", export_params=True, # store the trained parameter weights inside the model file opset_version=11, # the ONNX version to export the model to do_constant_folding=True, verbose=True) print('Done!') ``` If you export onnx with pytorch 1.7, and try convert to trt engine, it will shows this error: ``` [8] Assertion failed: ctx->network()->hasExplicitPrecision() && "TensorRT only supports multi-input conv for explicit precision QAT networks!" ``` You might will ask why using `torch.tensor(0.03, dtype=torch.float)` in `>` op, it was **because if not, it will cast float to double and invoke a double data type in onnx**. Which will make onnx2trt raise another error called `unsupported datatype 11`. So how should we solve this awkward situation?

con2d by TRT with multi-input is only support by quantized networks
another discuss about some workarounds
https://githubmemory.com/repo/onnx/onnx-tensorrt/issues/609
seems this is NOT such easy to resolve if you ONLY have onnx model ,
if you have resource code ,try to stead con2d with some other OP to debug this

armand.zampierizn4wa · August 25, 2021, 2:02pm

Hello,

As of now I couldn’t sove this issue, unfortunately quantization (Using the onnx quantizer) didn’t worked and raised other issue, (same problem appeared with quantization aware training, inwhich some layers (QuantizeLinear) also depend on a dynamic kernel (y_scale), this may be solved by using static quantization, hadn’t the time to investigate further).
In addition the quantization from onnx replacing conv2d with convInteger is likely to produce the same error.
Modifying the pytorch source (GitHub - PengBoXiangShang/SiamRPN_plus_plus_PyTorch: SiamRPN, SiamRPN++, unofficial implementation of "SiamRPN++" (CVPR2019), multi-GPUs, LMDB.) doesn’t seem to me like a solution since, the convolution using the features of the target image on the source image is in the core of the network and of most single object tracking network (except goturn for example). To be more specific I don’t see any equivalent operation that doesn’t require a dynamic kernel; goturn uses fully connected layer as the head of the network which in fact could replace the convolution but the price is adding a huge amount of unusfull operations int he process (you could in theory replace all convolution with fully connected layers but the performance would be terrible)
Concerning the workaround, it’s not adapted in our case since it assumes a 1x1 kernel which in our case (image correlation) will raise noise and greatly degrade the tracker performances.
The absence of possibility to apply this operation is a huge surprise and is particularly disapointing. This kind of operation is very basic in computer vision and was in use way before the neural network “boom”.

From what I saw only this article :
http://wintics.com/fr/building-smart-camera-applications-at-an-industrial-scale-by-leveraging-cutting-edge-deep-learning-techniques-2/
affirm to have found a workaround, but this one is not detailed and probably proprietary.

If facing this issue I would recommand switching to a detection based tracking or a goturn architecture as of now until tensorrt support is added. It is also possible to implement the missing layers out of tensorrt or switch framework in between but also the last seems time-consuming and will reduce inference timing performance. (tried with onnxruntime with poor performance and opened a ticket there to understand the problem: onnxruntime Jetson tx2 cuda · Issue #8771 · microsoft/onnxruntime · GitHub)

If you found any path of exploration, don’t hesitate to mention it, but as stated the issue is not trivial.
Hope this help and good luck.

Topic		Replies	Views
Openvx/Visionworks graph input from GPU memory Buffer TensorRT tensorrt , cuda , visionworks	2	1330	July 8, 2021
Build TRT engine with onnx QAT model throws segmentation fault TensorRT	3	1265	August 12, 2021
Running a pytorch network converted to ONNX with TensorRT on the TX2 Jetson TX2	24	8819	October 18, 2021
TensorRT 7 conv3d is not running on Tensor Cores TensorRT	7	1343	September 22, 2021
Encountered known unsupported method torch.max_pool3d DeepStream SDK	12	1251	October 12, 2021
ONNX Model and Tensorrt Engine gives different output TensorRT tensorrt , onnx	13	5210	June 29, 2022
Keras CRNN model conversion to tensorrt engine error TensorRT tensorrt , tensorflow , onnx	3	955	April 8, 2022
Batch Inference Wrong in Python API TensorRT	15	3538	October 12, 2021
SiamMask on Jetson Xavier NX, pytorch, slow FPS Jetson Xavier NX pytorch	22	3084	October 18, 2021
Work with batch in TensorRT TensorRT tensorrt , opencv , cuda , tensorflow	20	3763	July 20, 2021

Openvx/Visionworks graph input from GPU memory Buffer

Description

Environment

Related topics