Tensor RT server with GPU only instances high CPU usage


I have installed a tensor rt inference server (using prebuild docker image for ubuntu 18.04, version 20.01-py3).
I am using perf_client (compiled directly from my host - no docker).

Everything seems to be working fine (I am testing on a laptop with GeForce GTX 1660 Ti and i7-9750H with 6p/12l cores). I have set up the trtserver to use only one model (resnet50_netdef). All working well.
I wanted to test potential speed gain (in terms of infer/sec) that should be achievable by using parallel inference, so I have added the following to the model’s config file (config.pbtxt):

instance_group [
    count: 2 
    kind: KIND_GPU

I have first tried with 2 and than 4. All working well, the speed gain is significant, however I have spotted some strange behavior:

The trtserver uses (apart of the GPU) exactly as many CPUs (at 100%) as many instance_group.count I set in the model config.pbtxt file

the CPU usage was tested with simple top command.

Is this an expected behavior? Or do I miss something?
I have searched the web and the forum and found no information on that ‘issue’.

Hi again,

I have been digging more and now I am more and more confused about the findings. I was almost sure that the behavior is a result of some bug that looks like a busy wait (polling) of the thread handling each GPU interference instance. But now I am not so sure anymore. But I still cannot understand why would it need a 100% of a CUP core to manage one instance of a model loaded to the GPU.

I have done the following tests:

only one model: resnet50_netdef

name: "resnet50_netdef"
platform: "caffe2_netdef"
max_batch_size: 10
input [
    name: "gpu_0/data"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
output [
    name: "gpu_0/softmax"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "resnet50_labels.txt"
instance_group [
    count: 1
    kind: KIND_GPU
instance_group [
    count: 1
    kind: KIND_CPU

I was changing the count of CPU and/or GPU instances.
Please mind that I have only one physical GPU and 6 cores CPU.

I have been testing with:

./perf_client -m resnet50_netdef -i gRPC --shared-memory=cuda --concurrency-range 4:4 -b 10

For inference server with config with 5 and 6 instances the command above used --concurrency-range 5:5 and 6:6 respectively.

For CPU only inference server I have used --shared-memory=shared.

The results are as follow:

  1. CPU Only
  • 1 CPU only instance: 8 infer/sec, 1 CPU used @ 100%
  • 6 CPU only instances: 24 infer/sec, 6 CPUs used @ 100%
  • perf top shows libtorch.so as main CPU consumer.
  • GPU Only
    • 1 GPU instance: 284 infer/sec, 1 CPU used @ 100%
    • 2 GPU instances: 306 infer/sec, 2 CPUs used @ 100%
    • 3 GPU instances: 314 infer/sec, 3 CPUs used @ 100%
    • 4 GPU instances: 320 infer/sec, 4 CPUs used @ 100%
    • perf top shows libcuda.so.440.33.01 as main CPU consumer. From 1 to 2 GPUs difference is significant. Above the 2 instances (on my particular GPU) the actual gain is similar to the amount of inferences that additional CPU core can process.
  • Mixed
    • 1GPU + 1CPU: 290 infer/sec, 2 CPU used @ 100%
    • 1GPU + 2CPU: 292 infer/sec, 3 CPU used @ 100%
    • 1GPU + 3CPU: 292 infer/sec, 4 CPU used @ 100% (libtorch.so takes over)
    • 1GPU + 4CPU: 292 infer/sec, 5 CPU used @ 100%
    • 1GPU + 5CPU: 288 infer/sec, 6 CPU used @ 100%
    • 2GPU + 1CPU: 312 infer/sec, 3 CPU used @ 100%
    • 2GPU + 2CPU: 316 infer/sec, 4 CPU used @ 100%
    • 2GPU + 3CPU: 320 infer/sec, 5 CPU used @ 100% (libtorch.so takes over)
    • 2GPU + 4CPU: 320 infer/sec, 6 CPU used @ 100%

    As next step I have tested with cpulimit applied on the trtserver process:

    cpulimit -l 50 -p 32703

    Without CPU limit the single GPU instance yielded the following results:

    Request concurrency: 4
        Request count: 144
        Throughput: 288 infer/sec
        Avg latency: 139105 usec (standard deviation 2174 usec)
        p50 latency: 138776 usec
        p90 latency: 139160 usec
        p95 latency: 139279 usec
        p99 latency: 152583 usec
        Avg gRPC time: 139254 usec ((un)marshal request/response 6 usec + response wait 139248 usec)
        Request count: 172
        Avg request latency: 138807 usec (overhead 7 usec + queue 104069 usec + compute 34731 usec)

    and the nvidia-smi shown 94% GPU utilization

    With cpulimit set to 50% the very same yielded:

    Request concurrency: 4
        Request count: 76
        Throughput: 152 infer/sec
        Avg latency: 248917 usec (standard deviation 213998 usec)
        p50 latency: 137569 usec
        p90 latency: 663526 usec
        p95 latency: 663847 usec
        p99 latency: 664059 usec
        Avg gRPC time: 267161 usec ((un)marshal request/response 8 usec + response wait 267153 usec)
        Request count: 98
        Avg request latency: 266727 usec (overhead 6 usec + queue 199972 usec + compute 66749 usec)

    and the nvidia-smi shown 54% GPU utilization

    Stdev rose 100 times and the performance dropped by 50%.

    And the last test:
    Config: 2 GPU instances, 0 CPU instances

    without the cpulimit:
    306 infer/sec, 2 CPUs used @ 100, nvidia-smi shows 99% GPU utilization

    with the cpulimit set to 100% (!!!)
    Throughput: 168 infer/sec, nvidia-smi shows 49% GPU utilization

    So now I have given it a full CPU (but only one) and it still fails to use full power of the GPU.

    Can anyone explain the behavior? Is it a bug or is it expected?

    I’m not entirely sure what you find unusual, so perhaps you can summarize. Even when a model is configured to run on gpu (using instance_group, but it is also the default to run on GPU if it is available) the inference server still must use CPU, and depending on the model and the throughput the CPU usage can be significant.

    If you use the following:

    instance_group [
    count: 1
    kind: KIND_GPU
    instance_group [
    count: 1
    kind: KIND_CPU

    You are asking the inference server to use 2 copies of the model, one that runs on GPU and one that runs on CPU. The version of the model that runs on CPU doesn’t necessarily use a single CPU core, it depends on the model framework but in general a model running on CPU may use many cores.

    Good morning David,

    thank you for looking at this. I’ll try to summarize.
    Let’s drop the cases with model instances running on the CPU and focus only on the GPU ones.

    I understand that the model running on GPU still uses CPU. I understand that the CPU usage may be significant. But it does seem strange that EXACTLY every copy of the model being run on the GPU needs EXACTLY one CPU core at EXACTLY 100%. And trimming the CPU allowance for the process to ~50% cuts the inference server performance by ~50% and GPU utilization by ~50% (nvidia-smi).

    This could have been explained by a simple fact that my CPU performance could be a bottle neck. However if that was the case why (in case of a single model on GPU) the inference server does not attempt to use more CPU cores?

    This exactness (each model instance on GPU uses exactly one CPU core with 100% utilization) seems like a busy loop (polling) somewhere in the code handling GPU inference. Something like a busy wait for data from GPU?

    I think you are likely right that you use-case is CPU bound. As to why the server “use more CPU cores” to accelerate a single model instance is because that is not really possible in general. In the server there is a single CPU thread associated with each model instance (the TensorRT models are handled somewhat differently but the same analysis mostly applies). That CPU thread gets the inference request and then is responsible for scheduling all the GPU memory copy and kernel execution work. The scheduling of that work cannot be performed by multiple CPU threads. If you have 2 instances of a model then you will have 2 CPU threads, each scheduling work onto the same GPU. It is somewhat surprising that going to 4 instances still have all 4 CPU threads a 100%… at some point you should become GPU bound.

    Are you familiar with Nsight Systems profiling tool? It is fairly easy to use and can give you an timeline of CPU and GPU activity where you can see what those CPUs are doing.