NvBufSurfTransform call is slow when copying GPU surface

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
2070 Super
• DeepStream Version
6.0
• JetPack Version (valid for Jetson only)
• TensorRT Version
8.0.1
• NVIDIA GPU Driver Version (valid for GPU only)
470.82.00
• Issue Type( questions, new requirements, bugs)
Questions
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
gstdsexample’s get_converted_mat
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)
I’m trying to copy the image from the GPU (pretty much the same code as gstdexample’s get_converted_mat(). However, the call to NvBufSurfTransform is very slow. Usually 50-60 ms. I thought this call was done on the GPU?

auto cudaStatus = cudaSetDevice(gpuID);
if (cudaStatus != cudaSuccess) return;
cudaStatus = cudaStreamCreate(&cudaStream);
if (cudaStatus != cudaSuccess) return;

NvBufSurfaceCreateParams params;
params.gpuId = gpuID;
params.width = width;
params.height = height;
params.size = 0;
params.colorFormat = NVBUF_COLOR_FORMAT_RGBA;
params.layout = NVBUF_LAYOUT_PITCH;
params.memType = integrated ? NVBUF_MEM_DEFAULT : NVBUF_MEM_CUDA_PINNED;
if (NvBufSurfaceCreate(&bufferSurface, 1, &params) != 0) return;

NvBufSurfTransformConfigParams transformConfig;
transformConfig.compute_mode = NvBufSurfTransformCompute_Default;
transformConfig.gpu_id = gpuID;
transformConfig.cuda_stream = cudaStream;
NvBufSurfTransform_Error status = NvBufSurfTransformSetSessionParams(&transformConfig);
if (status != NvBufSurfTransformError_Success) return;

NvBufSurfaceMemSet(bufferSurface, 0, 0, 0);

NvBufSurfTransformRect srcRect = {0, 0, width, height};
NvBufSurfTransformRect dstRect = {0, 0, width, height};
NvBufSurfTransformParams transformParams;
transformParams.src_rect = &srcRect;
transformParams.dst_rect = &dstRect;
transformParams.transform_flag = NVBUFSURF_TRANSFORM_FILTER | NVBUFSURF_TRANSFORM_CROP_SRC | NVBUFSURF_TRANSFORM_CROP_DST;
transformParams.transform_filter = NvBufSurfTransformInter_Default;

NvBufSurface tmpSurface = *surface;
tmpSurface.numFilled = 1;
tmpSurface.batchSize = 1;
tmpSurface.surfaceList = &(surface->surfaceList[frameMeta->batch_id]);

// THIS CALL TAKES 50-60 MS
status = NvBufSurfTransform(&tmpSurface, bufferSurface, &transformParams);
if (status != NvBufSurfTransformError_Success) return;

...