the latency time is linearly increasing when concurrent threads increase more than 2

hello, for tensorrt serving, my config.pbtxt is:

name: "my_model"
platform: "tensorrt_plan"
max_batch_size: 10
input [
   {
      name: "input_images"
      data_type: TYPE_FP32
      format: FORMAT_NCHW
      dims: [ 3, 1376, 800 ]
   }
]
output [
   {
      name: "feature_fusion/Conv_7/Sigmoid"
      data_type: TYPE_FP32
      dims: [ 344, 200, 1]
   }
]
instance_group [
  {
    kind: KIND_GPU,
    count: 1
  }
]

and when I use

build/perf_client -m my_model -d -c10 -l2000 -p1000 -b1 -v

to test the concurrent performance.
I get the result:

Request concurrency: 10
  Pass [1] throughput: 35 infer/sec. Avg latency: 278346 usec (std 37016 usec)
  Pass [2] throughput: 34 infer/sec. Avg latency: 289869 usec (std 11219 usec)
  Pass [3] throughput: 35 infer/sec. Avg latency: 282968 usec (std 8233 usec)
  Client:
    Request count: 35
    Throughput: 35 infer/sec
    Avg latency: 282968 usec (standard deviation 8233 usec)
    Avg HTTP time: 281752 usec (send 8178 usec + response wait 272619 usec + receive 955 usec)
  Server:
    Request count: 46
    Avg request latency: 196643 usec (overhead 973 usec + queue 167606 usec + compute 28064 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, 21 infer/sec, latency 47625 usec
Concurrency: 2, 36 infer/sec, latency 55028 usec
Concurrency: 3, 37 infer/sec, latency 82313 usec
Concurrency: 4, 36 infer/sec, latency 110229 usec
Concurrency: 5, 38 infer/sec, latency 135002 usec
Concurrency: 6, 35 infer/sec, latency 170737 usec
Concurrency: 7, 36 infer/sec, latency 198551 usec
Concurrency: 8, 35 infer/sec, latency 230402 usec
Concurrency: 9, 36 infer/sec, latency 251738 usec
Concurrency: 10, 35 infer/sec, latency 282968 usec

Obviously, when concurrent threads increase more than 2, the latency time is linearly increasing. Is this normal? and how to decrease this latency ?

Linux distro and version:

LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.4.1708 (Core)
Release:	7.4.1708
Codename:	Core

other envirs:

nvcr.io/nvidia/tensorrtserver   18.11-py3
GPU type: Tesla v100
nvidia driver version: NVIDIA-SMI 410.48
CUDA version: 9.0
CUDNN version: 7.3.0

We are reviewing this and will keep you updated.

@NVES I just test resnet50_netdef model obtained by dl-inference-server/examples/fetch_models.sh. When I use build/perf_client -m resnet50_netdef -d -c10 -l2000 -p1000 -b1 -v, the latency time also seems to be linearly increasing when concurrent threads increase more than 2:

Request concurrency: 10
  Pass [1] throughput: 196 infer/sec. Avg latency: 50515 usec (std 5364 usec)
  Pass [2] throughput: 197 infer/sec. Avg latency: 50684 usec (std 4757 usec)
  Pass [3] throughput: 197 infer/sec. Avg latency: 50477 usec (std 5569 usec)
  Client:
    Request count: 197
    Throughput: 197 infer/sec
    Avg latency: 50477 usec (standard deviation 5569 usec)
    Avg HTTP time: 50491 usec (send 349 usec + response wait 50122 usec + receive 20 usec)
  Server:
    Request count: 241
    Avg request latency: 35365 usec (overhead 56 usec + queue 15104 usec + compute 20205 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, 124 infer/sec, latency 8102 usec
Concurrency: 2, 172 infer/sec, latency 11635 usec
Concurrency: 3, 190 infer/sec, latency 15771 usec
Concurrency: 4, 180 infer/sec, latency 22147 usec
Concurrency: 5, 197 infer/sec, latency 25262 usec
Concurrency: 6, 196 infer/sec, latency 30450 usec
Concurrency: 7, 197 infer/sec, latency 35268 usec
Concurrency: 8, 199 infer/sec, latency 40301 usec
Concurrency: 9, 199 infer/sec, latency 45343 usec
Concurrency: 10, 197 infer/sec, latency 50477 usec

@86108429 what kind of the GPU are u using? V100?

@J8oe yes, V100, have you solve this problem?

@NVES any feedback?

instance_group [
{
kind: KIND_GPU,
count: 1
}
]

set instance_group count >1, and take a try.

BTW, what the version of tensor inference server are u using?