Poor performance due to INMSLayers followup nms_layer_output(1)[DevicetoShapeHostCopy] and trainstation

After looking a bit more, it seems this has been an issue for a longer time and there is never an answer to this topic from nvidia.

as stated here and in my previous post Output-tensor-meta Access RAW model output with batch dimension i will keep going with NMS and RoiAlign outside the model with a custom postprocessor, until someone feels obligated to give an answer this issue.