"Engine buffer is full"

Hello,

I am using the TensorRT-Tensorflow Integration (Sub-Graph Optimization) with TensorRT 3.0.4 and Tensorflow 1.10 on the Tensorflow Object Detection API, which achieves correct results (bounding boxes). However, I have some questions about the TensorRT log output and behavior:

2018-08-23 11:35:12.094128: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.

This happens when executing create_inference_graph(). What does this mean and is it relevant?

2018-08-23 11:35:33.590131: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300

This happens each time when running the session. Is this an out-of-memory error or what causes it? The following error immediately follows:

2018-08-23 11:35:33.590161: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0

Does it mean that the TensorRT execution failed and now Tensorflow is executing instead?

  1. (might be related to 2) I also noticed that TensorRT does not seem to allocate any GPU-memory when running the session, but does so when creating the inference graph. Is that expected behavior?

Any help is greatly appreciated, thanks in advance!

Hello, can you provide details on the platforms you are using?

Linux distro and version
GPU type
nvidia driver version
CUDA version
CUDNN version
Python version [if using python]
Tensorflow version
TensorRT version

Linux distro and version CentOS 7.5
GPU type Quadro M2200
nvidia driver version 390.77
CUDA version 9.0
CUDNN version 7.0.5
Python version [if using python] 2.7
Tensorflow version 1.10 (master, built from source, 22 August 2018)
TensorRT version 3.0.4

Thank you. Will keep you updated on what we find.

Yes, looks like tried using TRT but didn’t succeed. Can you post the full log?

Sure, there you go:

2018-08-30 14:47:31.126811: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
2018-08-30 14:47:31.128856: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2018-08-30 14:47:31.129278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0
2018-08-30 14:47:31.129300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-30 14:47:31.129310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972]      0 
2018-08-30 14:47:31.129318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0:   N 
2018-08-30 14:47:31.129455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2828 MB memory) -> physical GPU (device: 0, name: Quadro M2200, pci bus id: 0000:01:00.0, compute capability: 5.2)
2018-08-30 14:47:34.184799: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2951] Segment @scope '', converted to graph
2018-08-30 14:47:34.184834: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2018-08-30 14:47:39.693597: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_0 creation for segment 0, composed of 85 nodes succeeded.
2018-08-30 14:47:42.076072: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-08-30 14:47:42.224466: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-08-30 14:47:42.257691: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:403] Optimization results for grappler item: tf_graph
2018-08-30 14:47:42.257725: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   constant folding: Graph size after: 11435 nodes (-495), 15635 edges (-524), time = 1057.04102ms.
2018-08-30 14:47:42.257735: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   layout: Graph size after: 11450 nodes (15), 15653 edges (18), time = 308.294ms.
2018-08-30 14:47:42.257743: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   TensorRTOptimizer: Graph size after: 11366 nodes (-84), 15566 edges (-87), time = 6930.48096ms.
2018-08-30 14:47:42.257751: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   constant folding: Graph size after: 11363 nodes (-3), 15564 edges (-2), time = 613.757ms.
2018-08-30 14:47:42.257764: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   TensorRTOptimizer: Graph size after: 11363 nodes (0), 15564 edges (0), time = 1430.89197ms.
2018-08-30 14:47:42.257775: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:403] Optimization results for grappler item: my_trt_op_0_native_segment
2018-08-30 14:47:42.257787: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   constant folding: Graph size after: 86 nodes (0), 88 edges (0), time = 121.162ms.
2018-08-30 14:47:42.257795: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   layout: Invalid argument: The graph is already optimized by layout optimizer.
2018-08-30 14:47:42.257806: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   TensorRTOptimizer: Graph size after: 86 nodes (0), 88 edges (0), time = 10.492ms.
2018-08-30 14:47:42.257816: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   constant folding: Graph size after: 86 nodes (0), 88 edges (0), time = 137.719ms.
2018-08-30 14:47:42.257825: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:405]   TensorRTOptimizer: Graph size after: 86 nodes (0), 88 edges (0), time = 12.276ms.
2018-08-30 14:48:37.387138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0
2018-08-30 14:48:37.387308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-30 14:48:37.387381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972]      0 
2018-08-30 14:48:37.387470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0:   N 
2018-08-30 14:48:37.387961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2828 MB memory) -> physical GPU (device: 0, name: Quadro M2200, pci bus id: 0000:01:00.0, compute capability: 5.2)
2018-08-30 14:48:55.002149: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:48:55.002175: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:48:56.213187: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.31GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-08-30 14:48:56.332753: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.58GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-08-30 14:48:56.609630: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-08-30 14:48:57.041135: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:48:57.041211: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:48:57.453425: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:48:57.453487: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:48:57.858409: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:48:57.858510: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:48:58.282362: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:48:58.282445: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:48:58.696690: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:48:58.696783: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:48:59.120853: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:48:59.120941: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:48:59.527978: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:48:59.528044: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:48:59.959494: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:48:59.959591: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:49:00.388077: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:49:00.388151: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:49:00.820113: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:49:00.820202: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0
2018-08-30 14:49:01.253252: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=300
2018-08-30 14:49:01.253369: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_0

Hello,

can you please share with us the script/code/dataset sample to reproduce this? It’d help us debug this. If you’d like, you can direct message me.

Hello,

Your conversion batch size is 1, i.e. max_batch_size=1

on the other hand execution batch size is 300, i.e. requested batch=300.

Please set max_batch_size=300 in conversion and try again.

@NVES I am having the same problem on Jetson Xavier. I converted my frozen weights to trt using this code:

trt_graph = trt.create_inference_graph(
        input_graph_def=frozen_graph,
        outputs=output_names,
        max_batch_size=1,
        max_workspace_size_bytes=4000000000,
        precision_mode='FP16',
        minimum_segment_size=50
    )

    tf_config = tf.ConfigProto()
    tf_config.gpu_options.allow_growth = True

    tf_sess = tf.Session(config=tf_config)

    tf.import_graph_def(trt_graph, name='')

Then when I session.run(), I get this:

2018-12-06 12:42:18.240586: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=100
2018-12-06 12:42:18.240680: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_1

I believe you are executing Faster/Mask R-CNN. The problem with it is that the Graph abuses the batch size in the last stage of the algorithm to compute outputs for the 100 boxes from the RPN. Change the parameter which is currently set to 100 in your protobuf config and change it to 99 to see if I am right. I don’t know a simple way around this. Object Detection algorithms, especially Faster RCNN seem to be hard to convert to TRT.

I sure am using Faster rcnn! So I should change that value in the .config to 99 and leave max_batch_size at 1? I’ll let you know how it goes

@Klamue I changed the following line in the config:

    first_stage_max_proposals: 99 #From 100

And I get the same error as before, just with requested_batch=99:

2018-12-07 14:35:31.403165: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:260] Engine buffer is full. buffer limit=1, current entries=1, requested batch=99
2018-12-07 14:35:31.403291: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:277] Failed to get engine batch, running native segment for my_trt_op_1

Any ideas?

There’s also two other things in the config set to 100 under second_stage_post_processing: max_detections_per_class and max_total_detections. Should I set those to 99 too?

One thing I noticed is TensorRT seems to be creating two segments: my_trt_op_0_native_segment and my_trt_op_1_native_segment. Is it possible that these correlate to the RPN and classification segments respectively? Because if so, then the problem is with the classification and not the RPN since my_trt_op_1_native_segment is the one that’s failing.

If you think it would help, I can post the whole config file

Hi @atyshka, I am also meeting the same problem. Any update follows up? Have you solved it? Thanks!

@yichunchen, unfortunately I have not. This seems to be a known bug with tensorflow and faster rcnn. Let’s hope it’s resolved in tf 2.0

I fixed this problem by setting the batch size to num_images x 300, for instance if you’re going to process 8 images at a time, set the batch size to 2400.

num_images = 8
trt_graph = trt.create_inference_graph(
    input_graph_def=tf.get_default_graph().as_graph_def(),
    outputs=output_node,
    max_batch_size=num_images * 300,
    max_workspace_size_bytes=1 << 25,
    precision_mode='FP16',  # TRT Engine precision "FP32","FP16" or "INT8"
    minimum_segment_size=50  # minimum number of nodes in an engine
    )

On V100 the performance gain for FP16 was about 20%, I’m going to try INT8 next.