I am comparing the inference time of Keras to a TensorRT 5 optimized Keras model. The speedup of TensorRT is however only a factor 1.2…
I’m using the Python API of TensorRT 5 on AWS p3.2xlarge (Tesla V100 GPU) with the Ubuntu Deep Learning Base AMI. Moreover, my model is similar to DnCNN (GitHub - husqin/DnCNN-keras), but without the residual part and in NCHW data format.
Using 420 images of 512 by 512 pixels and a batch size of 16, I get the following results:
Keras: 14.97 seconds of total inference time
TRT FP32: 12.3 seconds of total inference time
Is this an expected speed-up or is something wrong?
Thanks for providing the model/data offline. I’m the getting the following when running TRT ( trt_or_keras = 0)
root@93787c5a2aac:/home/nvidia/zhen/reproduce.2405969# python trt_helpers.py
Using TensorFlow backend.
2018-10-22 16:55:00.598030: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
=== Automatically deduced input nodes ===
[name: "input_1"
op: "Placeholder"
attr {
key: "dtype"
value {
type: DT_FLOAT
}
}
attr {
key: "shape"
value {
shape {
dim {
size: -1
}
dim {
size: 1
}
dim {
size: -1
}
dim {
size: -1
}
}
}
}
]
=========================================
=== Automatically deduced output nodes ===
[name: "conv2d_17/add"
op: "Add"
input: "conv2d_17/transpose_1"
input: "conv2d_17/Reshape"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]
==========================================
Using output node conv2d_17/add
Converting to UFF graph
No. nodes: 454
TRT prediction time: 13.274264097213745
The keras inference (trt_or_keras = 1) is much much slower. I don’t think it’s using GPU.
root@93787c5a2aac:/home/nvidia/zhen/reproduce.2405969# python trt_helpers.py
Using TensorFlow backend.
2018-10-22 17:00:10.971094: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
80/420 [====>.........................] - ETA: 17:21
I apologize for the delay. Our engineers have been looking at the repro and here’s the feedback:
The python code is doing a lot of extra stuff doing inference. Only the TRT execution context time should be recorded.
This is the relevant part:
# Create Execution Context
with self._engine.create_execution_context() as context:
# This is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async(bindings=bindings, stream_handle=stream.handle, batch_size=batch_size)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
It looks like there are multiple calls to a TRT engine built with a max batch size of 16. I guess you could just build for a bigger batch size to get better “end-to-end” perf?
Recommend correcting customer benchmarking code and implementing an efficient way to manage data input/output with TRT. This doesn’t seem to be a TRT defect or performance issue.