Provide details on the platforms you are using:
Linux distro and version: Ubuntu 18.04
GPU type: NVIDIA Tesla T4
nvidia driver version: 410.79
CUDA version: 10.0
CUDNN version: 7.4
Python version [if using python]: 3.6.7
Tensorflow version: 1.13.0.dev20181218 (tf-nightly-gpu installed on 12/18/2018)
TensorRT version: 5.0.2.6
If Jetson, OS, hw versions
Describe the problem
We have a use-case where we need to run multiple GPU-enhanced TF sessions in parallel - each session runs inference using a fixed graph (possibly different across sessions) on a stream of batched inputs. The TF sessions are created with gpu_options.allow_growth=True, and gpu_options.per_process_gpu_memory_fraction appropriately specified for the number of concurrent sessions.
We use TensorRT to optimize/calibrate the TF graph offline, and then import the optimized graph to set up the TF sessions described above. When we convert the graph using ‘fp32’ or ‘fp16’ precision-modes, the process produces a static engine for TRTEngineOp nodes and everything works as expected.
However, when we try creating ‘int8’-precision mode graphs, the conversion process creates a dynamic engine, which gets instantiated the first time the graph is used in the concurrent setup above. We are finding that this lazy engine creation process above for int8 requires pretty much all of the GPU memory resources (14+ GB in our T4), resulting in GPU memory allocation issues if we try running more than one session in parallel. Setting the max_workspace_size_bytes and the session_config parameters to the trt.create_inference_graph(…) call has no impact on the memory used for engine creation.
Was hoping someone on the forum would be able to suggest a fix/workaround for this issue - some combination of (1) how we might be able to limit GPU memory usage during first-call invocation of the dynamic engine; (2) if there are appropriate configs we should be passing to the create_inference_graph or calib_graph_to_infer_graph invocations; or (3) if there is a way for us to persist the constructed static engine to allow subsequent loading for concurrent use.