Description
I am using this repository as a reference to include output tensor parsing and clustering in the yolo segmentation model using onnx graphsurgeon GitHub - marcoslucianops/DeepStream-Yolo-Seg: NVIDIA DeepStream SDK 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 implementation for YOLO-Segmentation models . Now i am facing large latencies on a call after NonMaximumSuppression layer, called nms_layer_output(1)[DevicetoShapeHostCopy] and an added trainstation syncing with Host.
As stated in https://forums.developer.nvidia.com/t/inmslayer-cuda-graph-invalidation-devicetoshapehostcopy/338025/6 the INMSLayer is somehow synchronizing with host because of the dynamic output size I guess.
Is there a way to fix the dimension of INMSLayer and make it run without this synchronisation/copy to host?
Which setting is causing this? Misssing TopK, incorrect maxoutputBboxes?
https://docs.nvidia.com/deeplearning/tensorrt/latest/\_static/c-api/classnvinfer1_1_1_i_n_m_s_layer.html
Environment
TensorRT Version : 10.3
GPU Type : 4070 Ti Laptop
Nvidia Driver Version : 575
CUDA Version : 12.6
CUDNN Version : 9.3
Operating System + Version : Ubuntu 24.04 LTS
Python Version : 3.10
PyTorch Version : 2.6.0
Container : nvcr.io/nvidia/deepstream:7.1-triton-multiarch
I tried to create an app that takes care of adding INMSLayer and RoiAlign to have more control but i couldnt find any way to stop it from adding the two additional layers
After looking a bit more, it seems this has been an issue for a longer time and there is never an answer to this topic from nvidia.
opened 10:00PM - 14 Nov 24 UTC
Module:Performance
triaged
## Description
NMS Layers are much slower on TensorRT than on PyTorch (44% of t… he performance) and I'm looking for any possible workaround. This seems to be acknowledged as a known issue in the TensorRT release notes [here](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html#rel-10-6-0):
> A performance regression is expected for TensorRT 10.x with respect to TensorRT 8.6 for networks with operations that involve data-dependent shapes, such as non-max suppression or non-zero operations
Is there any possible workaround or a fix planned in a specific future version? I am specifically using these layers inside a `FasterRCNN` network (as implemented in `torchvision` [here](https://pytorch.org/vision/main/models/faster_rcnn.html)). I observe this network to be much slower when running either with a single image or 4 images:
- Single image inference latency: 7.8ms on PyTorch, 13.3ms on TensorRT
- 4 image inference latency: 22.8ms on PyTorch, 53.5ms on TensorRT
When I run this network with per-layer profiling, I see that the `NonMaxSuppression` layers account for 75%+ of the overall inference time. I have verified this on TensorRT 10.0 and 10.6. I have tested using ONNX opset 11 and opset 17.
## Environment
**TensorRT Version**: 10.0, 10.6
**NVIDIA GPU**: GeForce RTX 4090
**NVIDIA Driver Version**: 550.54.15
**CUDA Version**: 12.4
**CUDNN Version**: unsure
Operating System:
Python Version (if applicable): 3.9
Tensorflow Version (if applicable):
PyTorch Version (if applicable): 2.2
Baremetal or Container (if so, version):
## Relevant Files
**Model link**: https://pytorch.org/vision/main/models/faster_rcnn.html
## Steps To Reproduce
1. Export FasterRCNN to ONNX
2. Pass ONNX into `trtexec`
4. Compare `trtexec` output to PyTorch equivalent benchmark
**Commands or scripts**:
**Have you tried [the latest release](https://developer.nvidia.com/tensorrt)?**: Yes I have tried TensorRT 10.6 and 10.0
**Can this model run on other frameworks?** For example run ONNX model with ONNXRuntime (`polygraphy run <model.onnx> --onnxrt`): Yes it runs on onnxruntime.
as stated here and in my previous post Output-tensor-meta Access RAW model output with batch dimension i will keep going with NMS and RoiAlign outside the model with a custom postprocessor, until someone feels obligated to give an answer this issue.
@Fiona.Chen can you please make sure someone responsible for TensorRT Forums has a look at this? On Deepstream Forum topics support is great but here it seems to be wasteland…
maybe @fanzh can help out?