TensorFlow/TRT with multiple TF sessions - Dynamic INT8 engine memory allocation errors

Provide details on the platforms you are using:
Linux distro and version: Ubuntu 18.04
GPU type: NVIDIA Tesla T4
nvidia driver version: 410.79
CUDA version: 10.0
CUDNN version: 7.4
Python version [if using python]: 3.6.7
Tensorflow version: 1.13.0.dev20181218 (tf-nightly-gpu installed on 12/18/2018)
TensorRT version: 5.0.2.6
If Jetson, OS, hw versions

Describe the problem
We have a use-case where we need to run multiple GPU-enhanced TF sessions in parallel - each session runs inference using a fixed graph (possibly different across sessions) on a stream of batched inputs. The TF sessions are created with gpu_options.allow_growth=True, and gpu_options.per_process_gpu_memory_fraction appropriately specified for the number of concurrent sessions.

We use TensorRT to optimize/calibrate the TF graph offline, and then import the optimized graph to set up the TF sessions described above. When we convert the graph using ‘fp32’ or ‘fp16’ precision-modes, the process produces a static engine for TRTEngineOp nodes and everything works as expected.

However, when we try creating ‘int8’-precision mode graphs, the conversion process creates a dynamic engine, which gets instantiated the first time the graph is used in the concurrent setup above. We are finding that this lazy engine creation process above for int8 requires pretty much all of the GPU memory resources (14+ GB in our T4), resulting in GPU memory allocation issues if we try running more than one session in parallel. Setting the max_workspace_size_bytes and the session_config parameters to the trt.create_inference_graph(…) call has no impact on the memory used for engine creation.

Was hoping someone on the forum would be able to suggest a fix/workaround for this issue - some combination of (1) how we might be able to limit GPU memory usage during first-call invocation of the dynamic engine; (2) if there are appropriate configs we should be passing to the create_inference_graph or calib_graph_to_infer_graph invocations; or (3) if there is a way for us to persist the constructed static engine to allow subsequent loading for concurrent use.

Hello,

  1. Workspace size should limit the memory available to the builder. Is this not happening? If so, can you share a small repro that demonstrates this behavior?

  2. You may write out a frozen protobuf after calling create_inference_graph().

Workspace size limits the memory used by the builder during the create_inference_graph() call, but does not seem to limit the memory used when the engine is created from the dynamic int8 TRTEngineOp node (which the code suggests stores the serialized version of the graph_def and the calibration info). Here, we are observing >8GB of GPU memory being used.

Will prepare a standalone reproduction of the problem and share.

Re your response to #3, my understanding is that this graph would still have only a dynamic version of the TRT op … the conversion to the static engine would still see the same problem.

Attached .zip file contains python programs/scripts to reproduce the issue.

To recap, the issue being highlighted is that the dynamic engine creation phase for int8 TRT graphs ends up requiring a fair amount of GPU memory resources, which prevent instantiation of multiple TF/CUDA contexts on the GPU even when there is significant post-engine creation headroom available.

The attached PDF shows this problem by contrasting execution-time GPU memory utilization of fp16 and int8 TRT graphs for the same underlying Tensorflow graph.
int8-engine-issue.zip (264 KB)
gpu-utilization.pdf (161 KB)

Hello,

Currently, this is a TRT limitation/design decision which does not allow user to restrict memory usage while building engine. We are working on a solution.

Background info:

  1. TRT is designed to use ALL available GPU memory during the build stage. Thus, TFTRT providing option to control GPU memory usage during build will have have any affect on build memory usage.
  2. Setting Workspace size in TRT using setMaxWorkSpace(X) API will only guarantee that the generated engine will X amount of memory.

This is a major request, and we’ll be working on adding this feature in a future release. No schedule available yet.