Inference on Very High Resolution Images

Description

I want to run an image enhancement network on a very high resolution image with tens of megapixels. I will definitely split the image into overlapping patches to fit into the GPU memory. Which approach allows higher throughput (frame rate)?

1 - Use small patch size (128x128 or 256x256) and large batch size (32 or 64)?
2 - Use large patch size (1024x1024 or 2048x2048) and small batch size (4 or 8)

I am using TensorRT C++ library.

Environment

TensorRT Version: TensorRT-7.2.3.4
GPU Type: NVIDIA GeForce GTX 1660 Ti with Max-Q Design
Nvidia Driver Version: 27.21.14.6079
CUDA Version: 11
CUDNN Version:
Operating System + Version: Windows 10
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance

Thanks!

Thanks for your fast reply! The model is a standard U-Net like the one in this link

Fundamentally, is choice 1 better or choice 2?
1 - Use small patch size (128x128 or 256x256) and large batch size (32 or 64)?
2 - Use large patch size (1024x1024 or 2048x2048) and small batch size (4 or 8)

Hi @oelgendy1,

Larger patches and smaller batches may help you, because it allows more scope for larger (higher-efficiency) tiles.
We would strongly suggest to benchmark it and select which one works best. As without knowledge of the network and the device, it is not possible to answer the question.

Thank you.

1 Like