I have split onnx models to multi sub onnx model files(use onnx.utils.extract_model api). With every sub onnx model ,I use trtexec to gen a trt engine. Finally, get all trt engines.
Then, I have load and deserialize all engines in my project, and use createExecutionContext api to create multi contexts. The question is, I have found that every engine createExecutionContext api will use 16M mem, so after all contexts be created, N*16M mem is used.
I am confused about:
1、what does tensorrt do when engine createExecutionContext
2、why every context holds 16M mem
3、Is there a way to share context between all engines.
Environment
TensorRT Version:7.2.2.3 GPU Type: V100. 2080ti Nvidia Driver Version: CUDA Version: CUDNN Version: Operating System + Version: Python Version (if applicable): TensorFlow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if container which image + tag):
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:
validating your model with the below snippet
check_model.py
import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!
Thank you for your replay. I have use tensorrt for many years, is familiar with it. Now I want to integrate tensorrt to our deeplearning framework(like onnxruntime). Our plan is split onnx model with separate operators, for the operator tensorrt support will run on tensorrt,for operator tensorrt doesn’t support will run our native kernel.
In this plan, every operator tensorrt supports will create it’s INetworkDefinition、ICudaEngine、IExecutionContext.
So when model running, there will be lots of ICudaEngines and IExecutionContexts, which use more gpu mem.
I am wonder why create IExecutionContexts use 16M mem, and if possible that all engins share context to reduce mem use.
Each execution context may hold some activation (persistent/scratch) memory. You can find more information here: Developer Guide :: NVIDIA Deep Learning TensorRT Documentation
During engine building, the log will also tell you how much activation memory the context will use: ex
[06/20/2022-03:15:10] [I] [TRT] Total Activation Memory: 8388608
It depends whether it’s activation memory, or persistent memory. First, please go through the user guide section on memory and diagnose which of these is the case. If it’s activation memory, then you can share it between contexts using createExecutionContextWithoutDeviceMemory(). If it’s persistent memory, the problem is mostly like edge mask tactics, and you can turn this off using setTacticSources().
I have print TRT7 logs when create context. I use gdb to watch nvidia-smi changes.
335 mEngine = mRuntime->deserializeCudaEngine(trt_model_stream, size, nullptr);
(gdb) n
2022-06-27 20:03:48 | log | INFO: Deserialize required 102589 microseconds.
336 mContext = mEngine->createExecutionContext();
(gdb) n
2022-06-27 20:03:58 | log | INFO: Allocated persistent device memory of size 901120
2022-06-27 20:03:58 | log | INFO: Allocated activation device memory of size 0
2022-06-27 20:03:58 | log | INFO: Assigning persistent memory blocks for various profiles
338 int engine_size = mEngine->getDeviceMemorySize();
(gdb) c.
2022-06-27 20:13:04 | [INFO ] | name: Conv_4, engine_size: 0
Before the below create context code execute, gpu mem is 211M, after is 228M. So create context use 17M mem.
mContext = mEngine->createExecutionContext();