Description
HI, I’ve converted few CNN models via TF-TRT (TRT=7.2.2.3; TF=2.4.1; CUDA=11.1, python API) to run them in pipeline. Im limited by GPU (2080i - 11G mem) memory rather than throughput. TF2.4 preallocate memory per model and hence doesn’t leave “space” to load/release models in RT. Is there a OS like way to swap/copy a model memory from the GPU out to a near fast DDR(slow latency copy) to release GPU memory for other models? Of course copy back the model when needed. I’ve tried NVIDIA MPS in EXCLUSIVE and DEFAULT mode but it didn’t perform well and crahsed when running 4xMobileNetV2-models (1.5G GPU mem each) + segmenattion model (4.5G)
Thanks,
Hanoch
Environment
python/TF 2.4
TensorRT Version: 7.2.2.3
GPU Type: 2080i
Nvidia Driver Version:
CUDA Version: 11.1
CUDNN Version:
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.6x
TensorFlow Version (if applicable): 2.4x
PyTorch Version (if applicable): 1.7
Baremetal or Container (if container which image + tag):
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
Please include:
- Exact steps/commands to build your repro
- Exact steps/commands to run your repro
- Full traceback of errors encountered