TensorRT 5 vs Keras; poor performance?

florandep · September 28, 2018, 1:44pm

Hi,

I am comparing the inference time of Keras to a TensorRT 5 optimized Keras model. The speedup of TensorRT is however only a factor 1.2…

I’m using the Python API of TensorRT 5 on AWS p3.2xlarge (Tesla V100 GPU) with the Ubuntu Deep Learning Base AMI. Moreover, my model is similar to DnCNN (GitHub - husqin/DnCNN-keras), but without the residual part and in NCHW data format.

Using 420 images of 512 by 512 pixels and a batch size of 16, I get the following results:
Keras: 14.97 seconds of total inference time
TRT FP32: 12.3 seconds of total inference time

Is this an expected speed-up or is something wrong?

NVES · September 28, 2018, 9:27pm

Hello,

can you share the source of how you are running Keras model w/ TensorRT?

thanks

NVES · September 29, 2018, 10:56pm

Specifically, can you describe/share source how you freeze your .h5 Keras model into a .pb, then converting that into a uff consumable by TRT?

florandep · October 1, 2018, 7:37am

Hi,

I’ve uploaded all relevant items into:

github.com

werty9021/TRT5-helpers/blob/master/trt_helpers.py

from keras.models import load_model
import keras.backend as K
import tensorflow as tf

import pycuda.driver as cuda
# This import causes pycuda to automatically manage CUDA context creation and cleanup.
import pycuda.autoinit

import tensorrt as trt
import uff

import numpy as np


def GiB(val):
    return val * 1 << 30


# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem:

This file has been truncated. show original

florandep · October 8, 2018, 7:35am

Hi,

Any updates on this?

NVES · October 8, 2018, 4:44pm

Can you share the dncnn.h5 and data.npy used? DM me if you don’t want to share publically.

NVES · October 22, 2018, 5:04pm

Hello,

Thanks for providing the model/data offline. I’m the getting the following when running TRT ( trt_or_keras = 0)

root@93787c5a2aac:/home/nvidia/zhen/reproduce.2405969# python trt_helpers.py
Using TensorFlow backend.
2018-10-22 16:55:00.598030: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
=== Automatically deduced input nodes ===
[name: "input_1"
op: "Placeholder"
attr {
  key: "dtype"
  value {
    type: DT_FLOAT
  }
}
attr {
  key: "shape"
  value {
    shape {
      dim {
        size: -1
      }
      dim {
        size: 1
      }
      dim {
        size: -1
      }
      dim {
        size: -1
      }
    }
  }
}
]
=========================================

=== Automatically deduced output nodes ===
[name: "conv2d_17/add"
op: "Add"
input: "conv2d_17/transpose_1"
input: "conv2d_17/Reshape"
attr {
  key: "T"
  value {
    type: DT_FLOAT
  }
}
]
==========================================

Using output node conv2d_17/add
Converting to UFF graph
No. nodes: 454
TRT prediction time: 13.274264097213745

The keras inference (trt_or_keras = 1) is much much slower. I don’t think it’s using GPU.

root@93787c5a2aac:/home/nvidia/zhen/reproduce.2405969# python trt_helpers.py
Using TensorFlow backend.
2018-10-22 17:00:10.971094: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
 80/420 [====>.........................] - ETA: 17:21

florandep · October 23, 2018, 12:42pm

Hello,

Thank you for looking at this. In your results it looks like Keras is indeed using the CPU. Have you installed tensorflow-gpu instead of tensorflow?

NVES · October 23, 2018, 10:48pm

Yes. I’m using TRT5 container and the tensorflow that’s included isn’t GPU-enabled (unrelated issue). I’m working around it. will report back.

NVES · October 30, 2018, 9:09pm

Hello,

we are still triaging and will keep you updated.

NVES · December 11, 2018, 10:01pm

Hello,

I apologize for the delay. Our engineers have been looking at the repro and here’s the feedback:

The python code is doing a lot of extra stuff doing inference. Only the TRT execution context time should be recorded.

This is the relevant part:

# Create Execution Context
            with self._engine.create_execution_context() as context:
                # This is generalized for multiple inputs/outputs.
                # inputs and outputs are expected to be lists of HostDeviceMem objects.
                # Transfer input data to the GPU.
                [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
                # Run inference.
                context.execute_async(bindings=bindings, stream_handle=stream.handle, batch_size=batch_size)
                # Transfer predictions back from the GPU.
                [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]

It looks like there are multiple calls to a TRT engine built with a max batch size of 16. I guess you could just build for a bigger batch size to get better “end-to-end” perf?

Recommend correcting customer benchmarking code and implementing an efficient way to manage data input/output with TRT. This doesn’t seem to be a TRT defect or performance issue.