Multi model inference - Swap GPU memory

HI, I’ve converted few CNN models via TF-TRT (TRT=7.2.2.3; TF=2.4.1; CUDA=11.1, python API) but im limited by GPU (2080i) memory rather than throughput. Is there a way to swap/copy a model memory from the GPU out to a fast DDR(slow latency copy) or CPU(high latency copy-less preferred) hence release GPU memory for other models? Of course copy back the model when needed, sort of a multi-tasking way of doing?
Thanks,
Hanoch

Hi, Request you to share your model and script, so that we can help you better.

Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

Thanks!

Hi, this is a general question for a general CNN model. i can share code if it is necessary!

Hi @hanoch.kremer,

In TF-TRT we do not have a fine control over the memory. The GPU memory is freed when the model is deleted and garbage collected. But loading the model every time you need to infer and deleting afterwards is probably a bad idea, it would be detrimental for performance.
If the whole model is converted to a single TRT engine, then one could do save memory by using the TRT model only (because a converted model stores 2 copies of the weights, one for TF and one for TRT). To do it in TF2, Loading plan file is not yet documented for TF2, but we have an example here.

Thank you.

Hi ,thanks for the info. Would the answer changes in case the model is a whole TF 2.4 (no TRT)? In that case what is the best prcatice to share limited GPU memory among few inference CNN models? What do you mean by “save memory”? Please allow access to the file as google ask. Thanks. Hanoch

Hi @hanoch.kremer,

Updated access permissions to the notebook. This example is only applicable for problems that are completely converted to TF-TRT.

Thank you.

Hi, still can’t access the notebook due to permissions issues

Hi @hanoch.kremer,

I have edited the previous reply with new link. You might be still referring to old one.
For your reference sharing new link again.

Thank you.

No it is OK, thanks

Hi, in case the mode is whole tf2.4 no tf-trt, can you recommend about a way to manage swap between few models for inference to share the limited GPU memory. Assuming a fast DDR available as a resource. Like a tiny OS or something. I’ve tried NVIDIA MPS but it didn’t perform well. Thanks, Hanoch

Hi @hanoch.kremer,

We can probably move the model to GPU on demand, by walking through the graph and moving the weights to GPU. Not sure about the details. Before we invest a lot of effort, please note that final performance will be poor (because moving data between CPU and GPU is slow, and incurs a large latency).
If our end goal is low latency, swapping idea is a probably a no-go.
If our goal is for Large throughput, then estimate overhead based on model size, pcie bandwidth and compare it to how much faster are you on GPU.
If it look promising, than we can try to implement it.

Hope this will help you.

Thank you.