Unexpected TensorRT5.1.2 Results vs TRTIS1.0.0 Results

System information
What is the top-level directory of the model you are using: The image classification examples that use the C++ client API
OS Platform and Distribution: Linux Ubuntu 16.04
TRTIS type of installation: docker image 19.03
TRTIS version: 1.0.0
TensorRT version: 5.1.2
CUDA/cuDNN version: 10.1
GPU model and memory: T4-16GB
Model used: resnet50-infer-5.uff

Describe the problem
Hi all, I am running the resnet50 inference using TensorRT5.1.2 and comparing the results versus TRTIS 1.0.0 (which uses tensorRT5.1.2 too). However, I got unexpected results, see below the commands to reproduce and the trace logs generated:

Test 1 with TensorRT: Run the inference C++ sample with TensorRT 5.1.2 and generate the TRT plan file

root@5f234debee5c:/workspace/tensorrt/bin# ./trtexec --uff=/workspace/tensorrt/data/resnet50/resnet50-infer-5.uff --output=GPU_0/tower_0/Softmax --uffInput=input,3,224,224 --iterations=40 --int8 --batch=128 --device=0 --avgRuns=100 --saveEngine=resnet50-infer-5_int8_128_plan

Source code / logs:

[I] uff: /workspace/tensorrt/data/resnet50/resnet50-infer-5.uff
[I] output: GPU_0/tower_0/Softmax
[I] uffInput: input,3,224,224
[I] iterations: 40
[I] int8
[I] batch: 128
[I] device: 0
[I] avgRuns: 100
[I] saveEngine: resnet50-infer-5_int8_128_plan
[I] Engine has been successfully saved to resnet50-infer-5_int8_128_plan
[I] Average over 100 runs is 31.0158 ms (host walltime is 31.0655 ms, 99% percentile time is 34.4498).
[I] Average over 100 runs is 31.0847 ms (host walltime is 31.1415 ms, 99% percentile time is 31.5433).
[I] Average over 100 runs is 31.2069 ms (host walltime is 31.2616 ms, 99% percentile time is 31.828).
[I] Average over 100 runs is 31.322 ms (host walltime is 31.3709 ms, 99% percentile time is 32.0401).
[I] Average over 100 runs is 31.4328 ms (host walltime is 31.4824 ms, 99% percentile time is 32.0001).
[I] Average over 100 runs is 31.5245 ms (host walltime is 31.5746 ms, 99% percentile time is 32.0573).
[I] Average over 100 runs is 31.6438 ms (host walltime is 31.6959 ms, 99% percentile time is 32.2762).
[I] Average over 100 runs is 31.5263 ms (host walltime is 31.5756 ms, 99% percentile time is 32.2927).
[I] Average over 100 runs is 31.4481 ms (host walltime is 31.497 ms, 99% percentile time is 31.9877).
[I] Average over 100 runs is 31.4966 ms (host walltime is 31.5488 ms, 99% percentile time is 31.958).
[I] Average over 100 runs is 31.5859 ms (host walltime is 31.635 ms, 99% percentile time is 32.0369).
[I] Average over 100 runs is 31.6306 ms (host walltime is 31.7573 ms, 99% percentile time is 32.1496).
[I] Average over 100 runs is 31.7578 ms (host walltime is 31.8077 ms, 99% percentile time is 32.4488).
[I] Average over 100 runs is 31.8588 ms (host walltime is 31.9081 ms, 99% percentile time is 32.45).
[I] Average over 100 runs is 31.9374 ms (host walltime is 31.9869 ms, 99% percentile time is 32.4997).
[I] Average over 100 runs is 32.0113 ms (host walltime is 32.0639 ms, 99% percentile time is 32.6639).
[I] Average over 100 runs is 32.0946 ms (host walltime is 32.1611 ms, 99% percentile time is 32.5406).
[I] Average over 100 runs is 32.1594 ms (host walltime is 32.2271 ms, 99% percentile time is 32.953).
[I] Average over 100 runs is 32.2697 ms (host walltime is 32.3191 ms, 99% percentile time is 33.0691).
[I] Average over 100 runs is 32.2766 ms (host walltime is 32.3777 ms, 99% percentile time is 33.0165).
[I] Average over 100 runs is 32.4827 ms (host walltime is 32.5354 ms, 99% percentile time is 34.2446).
[I] Average over 100 runs is 32.5598 ms (host walltime is 32.6123 ms, 99% percentile time is 33.3763).
[I] Average over 100 runs is 32.676 ms (host walltime is 32.728 ms, 99% percentile time is 33.5376).
[I] Average over 100 runs is 32.783 ms (host walltime is 32.8358 ms, 99% percentile time is 33.7599).
[I] Average over 100 runs is 32.8409 ms (host walltime is 32.894 ms, 99% percentile time is 33.5811).
[I] Average over 100 runs is 32.9227 ms (host walltime is 32.9749 ms, 99% percentile time is 33.6519).
[I] Average over 100 runs is 32.954 ms (host walltime is 33.0071 ms, 99% percentile time is 33.7234).
[I] Average over 100 runs is 33.063 ms (host walltime is 33.1381 ms, 99% percentile time is 33.8067).
[I] Average over 100 runs is 33.2247 ms (host walltime is 33.2773 ms, 99% percentile time is 34.1873).
[I] Average over 100 runs is 33.2557 ms (host walltime is 33.307 ms, 99% percentile time is 34.2221).
[I] Average over 100 runs is 33.3378 ms (host walltime is 33.4221 ms, 99% percentile time is 34.1745).
[I] Average over 100 runs is 33.4591 ms (host walltime is 33.5097 ms, 99% percentile time is 34.2466).
[I] Average over 100 runs is 33.473 ms (host walltime is 33.5931 ms, 99% percentile time is 34.2543).
[I] Average over 100 runs is 33.5921 ms (host walltime is 33.6431 ms, 99% percentile time is 34.2972).
[I] Average over 100 runs is 33.7104 ms (host walltime is 33.7627 ms, 99% percentile time is 34.605).
[I] Average over 100 runs is 33.7056 ms (host walltime is 33.7629 ms, 99% percentile time is 34.3632).
[I] Average over 100 runs is 33.6665 ms (host walltime is 33.7213 ms, 99% percentile time is 34.3326).
[I] Average over 100 runs is 33.7628 ms (host walltime is 33.818 ms, 99% percentile time is 34.6278).
[I] Average over 100 runs is 33.8401 ms (host walltime is 33.895 ms, 99% percentile time is 34.6502).
[I] Average over 100 runs is 33.716 ms (host walltime is 33.7705 ms, 99% percentile time is 34.7274).

** TensorRT Throughput= (128/33.716)*1000=~3,796 img/sec **

Test 2 with TensorRT Inference Server: Use the plan file generated above and run the inference with C++ Client API Image Classification perf_client.cc

root@R740:/workspace/src/test# /opt/tensorrtserver/bin/perf_client -m resnet50-infer-5_int8_128_plan_TRT5.1.2 -d -c1 -l200 -p6000 -b128 -u '100.71.242.146:8000'

Source code / logs:

*** Measurement Settings ***
  Batch size: 128
  Measurement window: 6000 msec
  Latency limit: 200 msec
  Concurrency limit: 1 concurrent requests

Request concurrency: 1
  Client:
    Request count: 8
    Throughput: 170 infer/sec
    Avg latency: 805038 usec (standard deviation 6411 usec)
    Avg HTTP time: 805053 usec (send/recv 636544 usec + response wait 168509 usec)
  Server:
    Request count: 9
    Avg request latency: 144555 usec (overhead 5034 usec + queue 44 usec + compute 139477 usec)
[ 0] SUCCESS
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, 170 infer/sec, latency 805038 usec

** TRTIS Throughput=170 infer/sec, latency= 805 msec**

Here is the model configuration on config.pbtxt file

name: "resnet50-infer-5_int8_128_plan_TRT5.1.2"
platform: "tensorrt_plan"
max_batch_size: 128
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "GPU_0/tower_0/Softmax"
    data_type: TYPE_FP32
    dims: [ 1, 1, 1000 ]
    label_filename: "labels.txt"
  }
]
instance_group [
  {
    count: 1
    gpus: 0
    gpus: 1
    kind: KIND_GPU
  }
]

I was expecting the TRTIS inference result around 3,796 img/sec instead of 170 img/sec, it is a huge difference. Could you please assist?