Multi model inference - Swap GPU memory


HI, I’ve converted few CNN models via TF-TRT (TRT=; TF=2.4.1; CUDA=11.1, python API) to run them in pipeline. Im limited by GPU (2080i - 11G mem) memory rather than throughput. TF2.4 preallocate memory per model and hence doesn’t leave “space” to load/release models in RT. Is there a OS like way to swap/copy a model memory from the GPU out to a near fast DDR(slow latency copy) to release GPU memory for other models? Of course copy back the model when needed. I’ve tried NVIDIA MPS in EXCLUSIVE and DEFAULT mode but it didn’t perform well and crahsed when running 4xMobileNetV2-models (1.5G GPU mem each) + segmenattion model (4.5G)


python/TF 2.4
TensorRT Version:
GPU Type: 2080i
Nvidia Driver Version:
CUDA Version: 11.1
CUDNN Version:
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.6x
TensorFlow Version (if applicable): 2.4x
PyTorch Version (if applicable): 1.7
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi, Request you to share your model and script, so that we can help you better.

Alternatively, you can try running your model with trtexec command.


Hi @hanoch.kremer,

Looks like this query is already posted. We request you to respond in the below thread.

Thank you.

Hi I was trying to ask that on the same thread but answered that only tf-trt questions are being addressed. What do you recommend I do?