Correctly parallelise preprocessing and inference of a batch with a U-Net model on RTX5000 GPU

Hello everyone,
I’m having a runtime problem with my U-Net Model because the image is very large.
I need help to reduce this time.

1- The model used was save with keras in python (tensorflow 2.10 was used)
python: → It consists of 2 folders (“assets”, “variables”) and 2 files (“keras_metadata.pb”, “saved_model.pb”)
2- The model is loaded in c++ with cppflow (c++: model = new cppflow::model) and the inference is performed (“libtensorflow-gpu-windows-x86_64-2.10.0” is used instead of installing tensorflow)
3- The aim is to crop a large image to obtain a batch of 400 images, each of size 256x256x3, and predict with the previous model
4- Only the time taken to crop and convert cv::Mat to cppflow::tensor is about 800 milliseconds.
c++: cv::Mat flat = src.reshape(1, * src.channels());
std::vector img_data = flat ;
const cppflow::tensor input(img_data, { 400, 256, 256, 3});
5- The inference time is around 1300 milliseconds.
c++: auto output = (*model)(tensor)

I have the impression that the images in the batch are not really run in parallel because this time varies considerably depending on the size of the batch.

I use cuda 11.2 and cudnn 8.101