Custom Autoencoder Model

So you can deploy your model with gst-nvinfer and parse the output image into the customized frame user meta.

Since your model only needs RGB/grayscale images for the ROI, the nvvideoconvert plugin can do the format conversion to RGB/grascale and ROI crop jobs before gst-nvinfer. And then you can get the ROI image directly from gst-nvinfer output GstBuffer.
There is sample code of how to get raw data from the NvBufSurface: DeepStream SDK FAQ - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums

To get the model output image, please set “output-tensor-meta=1” and “network-type=100” in the nvinfer configuration file to enable customized output tensor parsing.Gst-nvinfer — DeepStream documentation 6.4 documentation

And please parse the output tensor in your app:
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvinfer.html#tensor-metadata

There is the customized frame user meta sample /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-user-metadata-test