Tensor RT server with GPU only instances high CPU usage

chris.kasperski · February 20, 2020, 1:58pm

Hi,

I have installed a tensor rt inference server (using prebuild docker image for ubuntu 18.04, version 20.01-py3).
I am using perf_client (compiled directly from my host - no docker).

Everything seems to be working fine (I am testing on a laptop with GeForce GTX 1660 Ti and i7-9750H with 6p/12l cores). I have set up the trtserver to use only one model (resnet50_netdef). All working well.
I wanted to test potential speed gain (in terms of infer/sec) that should be achievable by using parallel inference, so I have added the following to the model’s config file (config.pbtxt):

instance_group [
  {
    count: 2 
    kind: KIND_GPU
  }
]

I have first tried with 2 and than 4. All working well, the speed gain is significant, however I have spotted some strange behavior:

The trtserver uses (apart of the GPU) exactly as many CPUs (at 100%) as many instance_group.count I set in the model config.pbtxt file

the CPU usage was tested with simple top command.

Is this an expected behavior? Or do I miss something?
I have searched the web and the forum and found no information on that ‘issue’.

chris.kasperski · February 24, 2020, 11:07am

Hi again,

I have been digging more and now I am more and more confused about the findings. I was almost sure that the behavior is a result of some bug that looks like a busy wait (polling) of the thread handling each GPU interference instance. But now I am not so sure anymore. But I still cannot understand why would it need a 100% of a CUP core to manage one instance of a model loaded to the GPU.

I have done the following tests:

only one model: resnet50_netdef
config.pbtxt:

name: "resnet50_netdef"
platform: "caffe2_netdef"
max_batch_size: 10
input [
  {
    name: "gpu_0/data"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "gpu_0/softmax"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "resnet50_labels.txt"
  }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

I was changing the count of CPU and/or GPU instances.
Please mind that I have only one physical GPU and 6 cores CPU.

I have been testing with:

./perf_client -m resnet50_netdef -i gRPC --shared-memory=cuda --concurrency-range 4:4 -b 10

For inference server with config with 5 and 6 instances the command above used --concurrency-range 5:5 and 6:6 respectively.

For CPU only inference server I have used --shared-memory=shared.

The results are as follow:

CPU Only

1 CPU only instance: 8 infer/sec, 1 CPU used @ 100%
6 CPU only instances: 24 infer/sec, 6 CPUs used @ 100%

GPU Only

1 GPU instance: 284 infer/sec, 1 CPU used @ 100%
2 GPU instances: 306 infer/sec, 2 CPUs used @ 100%
3 GPU instances: 314 infer/sec, 3 CPUs used @ 100%
4 GPU instances: 320 infer/sec, 4 CPUs used @ 100%

Mixed

1GPU + 1CPU: 290 infer/sec, 2 CPU used @ 100%
1GPU + 2CPU: 292 infer/sec, 3 CPU used @ 100%
1GPU + 3CPU: 292 infer/sec, 4 CPU used @ 100% (libtorch.so takes over)
1GPU + 4CPU: 292 infer/sec, 5 CPU used @ 100%
1GPU + 5CPU: 288 infer/sec, 6 CPU used @ 100%
2GPU + 1CPU: 312 infer/sec, 3 CPU used @ 100%
2GPU + 2CPU: 316 infer/sec, 4 CPU used @ 100%
2GPU + 3CPU: 320 infer/sec, 5 CPU used @ 100% (libtorch.so takes over)
2GPU + 4CPU: 320 infer/sec, 6 CPU used @ 100%

As next step I have tested with cpulimit applied on the trtserver process:

cpulimit -l 50 -p 32703

Without CPU limit the single GPU instance yielded the following results:

Request concurrency: 4
  Client: 
    Request count: 144
    Throughput: 288 infer/sec
    Avg latency: 139105 usec (standard deviation 2174 usec)
    p50 latency: 138776 usec
    p90 latency: 139160 usec
    p95 latency: 139279 usec
    p99 latency: 152583 usec
    Avg gRPC time: 139254 usec ((un)marshal request/response 6 usec + response wait 139248 usec)
  Server: 
    Request count: 172
    Avg request latency: 138807 usec (overhead 7 usec + queue 104069 usec + compute 34731 usec)

and the nvidia-smi shown 94% GPU utilization

With cpulimit set to 50% the very same yielded:

Request concurrency: 4
  Client: 
    Request count: 76
    Throughput: 152 infer/sec
    Avg latency: 248917 usec (standard deviation 213998 usec)
    p50 latency: 137569 usec
    p90 latency: 663526 usec
    p95 latency: 663847 usec
    p99 latency: 664059 usec
    Avg gRPC time: 267161 usec ((un)marshal request/response 8 usec + response wait 267153 usec)
  Server: 
    Request count: 98
    Avg request latency: 266727 usec (overhead 6 usec + queue 199972 usec + compute 66749 usec)

and the nvidia-smi shown 54% GPU utilization

Stdev rose 100 times and the performance dropped by 50%.

And the last test:
Config: 2 GPU instances, 0 CPU instances

without the cpulimit:
306 infer/sec, 2 CPUs used @ 100, nvidia-smi shows 99% GPU utilization

with the cpulimit set to 100% (!!!)
Throughput: 168 infer/sec, nvidia-smi shows 49% GPU utilization

So now I have given it a full CPU (but only one) and it still fails to use full power of the GPU.

Can anyone explain the behavior? Is it a bug or is it expected?

David_Goodwin · February 25, 2020, 8:24pm

I’m not entirely sure what you find unusual, so perhaps you can summarize. Even when a model is configured to run on gpu (using instance_group, but it is also the default to run on GPU if it is available) the inference server still must use CPU, and depending on the model and the throughput the CPU usage can be significant.

If you use the following:

instance_group [
{
count: 1
kind: KIND_GPU
}
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]

You are asking the inference server to use 2 copies of the model, one that runs on GPU and one that runs on CPU. The version of the model that runs on CPU doesn’t necessarily use a single CPU core, it depends on the model framework but in general a model running on CPU may use many cores.

chris.kasperski · February 26, 2020, 8:49am

Good morning David,

thank you for looking at this. I’ll try to summarize.
Let’s drop the cases with model instances running on the CPU and focus only on the GPU ones.

I understand that the model running on GPU still uses CPU. I understand that the CPU usage may be significant. But it does seem strange that EXACTLY every copy of the model being run on the GPU needs EXACTLY one CPU core at EXACTLY 100%. And trimming the CPU allowance for the process to ~50% cuts the inference server performance by ~50% and GPU utilization by ~50% (nvidia-smi).

This could have been explained by a simple fact that my CPU performance could be a bottle neck. However if that was the case why (in case of a single model on GPU) the inference server does not attempt to use more CPU cores?

This exactness (each model instance on GPU uses exactly one CPU core with 100% utilization) seems like a busy loop (polling) somewhere in the code handling GPU inference. Something like a busy wait for data from GPU?

David_Goodwin · February 27, 2020, 7:04pm

I think you are likely right that you use-case is CPU bound. As to why the server “use more CPU cores” to accelerate a single model instance is because that is not really possible in general. In the server there is a single CPU thread associated with each model instance (the TensorRT models are handled somewhat differently but the same analysis mostly applies). That CPU thread gets the inference request and then is responsible for scheduling all the GPU memory copy and kernel execution work. The scheduling of that work cannot be performed by multiple CPU threads. If you have 2 instances of a model then you will have 2 CPU threads, each scheduling work onto the same GPU. It is somewhat surprising that going to 4 instances still have all 4 CPU threads a 100%… at some point you should become GPU bound.

Are you familiar with Nsight Systems profiling tool? It is fairly easy to use and can give you an timeline of CPU and GPU activity where you can see what those CPUs are doing.

Topic		Replies	Views
the inference time increases linearly when running more than 2 tensorrt instance on single GPU TensorRT	1	1578	April 4, 2019
Triton server configuration instance group DeepStream SDK	4	1576	March 30, 2022
TensorRT Inference server low performance with 8 GPUs Triton Inference Server - archived	2	928	September 10, 2019
TF-TRT5: How to run tensorflow-tensorrt inferences with multiple GPUs TensorRT	10	3605	September 3, 2019
Help with increasing performance on TensorRT Inference Server TensorRT	0	405	August 19, 2019
Real Time Inference with Multi GPU - Multiple Model Triton Inference Server - archived	1	1395	January 29, 2020
Is there any other way besides TensorRT to increase the GPU utilization while doing the inference? TensorRT	1	747	March 25, 2019
Inconsistant GPU memory utilsation with parallel model instances Triton Inference Server - archived	0	807	July 5, 2021
Does tensorRT inference app eat cuda resources? TensorRT	4	571	May 3, 2023
TRTIS Tesla M60 performance issues (TensorRT model) Triton Inference Server - archived	4	1284	August 20, 2019

Tensor RT server with GPU only instances high CPU usage

Related topics