Issue with Multiplying a TensorRT with Another which leads to a dimension missmatch at a Concatenate Op

Description

Hello Guys,

I am having an issue with Converting an Onnx Model to TensorRT.
In the PyTorch script, I am Multiplying a vector of dimension torch.Size([1, 3, 52, 52, 2]) with one of torch.Size([1, 3, 1, 1, 2])

twofour = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]
y[…, 2:4] has the shape torch.Size([1, 3, 52, 52, 2])
self.anchor_grid[i] has the shape torch.Size([1, 3, 1, 1, 2])

The result in the PyTorch code is a vector of size torch.Size([1, 3, 52, 52, 2]) (same as the y[…, 2:4])

The Conversion to ONNX goes well but the conversion to TensorRT triggers the following issue:
    [2021-04-30 17:30:48   ERROR] Concat_218: all concat input tensors must have the same number of dimensions, but mismatch at input 1. Input 0 shape: [-1,-1,52,52,2], Input 1 shape: [-1,-1,3,-1,-1,2]
    While parsing node number 219 [Concat -> "384"]:

I put in copy the image of the ONNX Model.

I don’t understand how does this mistake happen. The multiplication is element-wise therefore the results should be the same size.

Here is a more complete snippet of the code in PyTorch:

onetwo = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
twofour = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]
foursix = y[..., 4:] 

new_y = torch.cat([twofour,twofour, foursix], dim=4) but I don't understand why the element-wise multiplication is not understood.

If I replace self.anchor_grid[i] by a float value, it works

Environment

TensorRT Version: 7.0.0
GPU Type: NVIDIA V100
Nvidia Driver Version:
CUDA Version: 10.2
CUDNN Version:
Operating System + Version: Ubuntu 18.0.4
Python Version (if applicable): 3.6.9
PyTorch Version (if applicable): 1.8
** ONNX Conversion done with 1.8
** ONNX IR version: 0.0.6
** Opset version: 12
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorrt:20.01-py3

Relevant Files

Steps To Reproduce

When converting with ./onnx2trt the mentioned issue occurs.

Thanks in advance,

Regards

Hi,
The below link might be useful for you
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.

Thanks!