Output-tensor-meta Access RAW model output with batch dimension

I dont use the CustomFunction because in this function configurable in the config, i could not find a way to get the device Buffers of NvDsInferTensorMeta or any other buffer ON Gpu/device only buffers on host are given.
I want to do PostProcessing directly on GPU using batch processing kernels for NonMaximumSuppression and ROI alignment for mask creation. I tried modifying the models output head, but this showed bad performance due to INMSLayer of TensorRT synchronizing with host on every call, i couldnt find any solution for this either and on Forum the topic gets no answers:
https://forums.developer.nvidia.com/t/inmslayer-cuda-graph-invalidation-devicetoshapehostcopy/338025/6

And since my kernels need relatively large buffers for processing i dont want to allocate them on every single call but once on initialization with the maximum expected size.

I dont get how some design decisions are made here, why is the originally complete output buffer split up or is this due to internals of how TensorRT handles batched inference?