TensorRT built-in NMS output lost when using Triton dynamic batching

Description

Hi,

I’m using the TensorRT Python API to build a YOLOv8 model with the built-in NMS layer, support dynamic batch. NMS output1 is num detections, I added some slice layers to output0 to get batch indices, boxes, scores and classes. The model run fine, and I can handle output easily.

But when I deploy it to Triton Server and enable dynamic batching, the inference out becomes all zero. From what I understand, the built-in NMS output tensors are flattened, which make the first dimension of the output doesn’t match the first dimension of the input. Because Triton requires all model outputs to preserve the batch dimension for batching, this flattening seems to cause misalignment when multiple requests are merged dynamically.

Questions:

  1. Can I somehow handle or reshape the flattened NMS output so that it works with Triton dynamic batching?
  2. Or does the built-in NMS doesn’t work with Triton dynamic batching, meaning I need to handle batching myself before sending inference request to Triton?

TensorRT Version: 10.3.0
GPU Type: T4
Nvidia Driver Version: 550
CUDA Version: 12.4
CUDNN Version: 9.6.0
Operating System + Version: Ubuntu 22.04.5 LTS
Python Version: 3.10.12