I have been digging more and now I am more and more confused about the findings. I was almost sure that the behavior is a result of some bug that looks like a busy wait (polling) of the thread handling each GPU interference instance. But now I am not so sure anymore. But I still cannot understand why would it need a 100% of a CUP core to manage one instance of a model loaded to the GPU.
I have done the following tests:
only one model: resnet50_netdef
dims: [ 3, 224, 224 ]
dims: [ 1000 ]
I was changing the count of CPU and/or GPU instances.
Please mind that I have only one physical GPU and 6 cores CPU.
I have been testing with:
./perf_client -m resnet50_netdef -i gRPC --shared-memory=cuda --concurrency-range 4:4 -b 10
For inference server with config with 5 and 6 instances the command above used --concurrency-range 5:5 and 6:6 respectively.
For CPU only inference server I have used --shared-memory=shared.
The results are as follow:
- CPU Only
- 1 CPU only instance: 8 infer/sec, 1 CPU used @ 100%
- 6 CPU only instances: 24 infer/sec, 6 CPUs used @ 100%
perf top shows libtorch.so as main CPU consumer.
- 1 GPU instance: 284 infer/sec, 1 CPU used @ 100%
- 2 GPU instances: 306 infer/sec, 2 CPUs used @ 100%
- 3 GPU instances: 314 infer/sec, 3 CPUs used @ 100%
- 4 GPU instances: 320 infer/sec, 4 CPUs used @ 100%
perf top shows libcuda.so.440.33.01 as main CPU consumer.
From 1 to 2 GPUs difference is significant. Above the 2 instances (on my particular GPU) the actual gain is similar to the amount of inferences that additional CPU core can process.
- 1GPU + 1CPU: 290 infer/sec, 2 CPU used @ 100%
- 1GPU + 2CPU: 292 infer/sec, 3 CPU used @ 100%
- 1GPU + 3CPU: 292 infer/sec, 4 CPU used @ 100% (libtorch.so takes over)
- 1GPU + 4CPU: 292 infer/sec, 5 CPU used @ 100%
- 1GPU + 5CPU: 288 infer/sec, 6 CPU used @ 100%
- 2GPU + 1CPU: 312 infer/sec, 3 CPU used @ 100%
- 2GPU + 2CPU: 316 infer/sec, 4 CPU used @ 100%
- 2GPU + 3CPU: 320 infer/sec, 5 CPU used @ 100% (libtorch.so takes over)
- 2GPU + 4CPU: 320 infer/sec, 6 CPU used @ 100%
As next step I have tested with cpulimit applied on the trtserver process:
cpulimit -l 50 -p 32703
Without CPU limit the single GPU instance yielded the following results:
Request concurrency: 4
Request count: 144
Throughput: 288 infer/sec
Avg latency: 139105 usec (standard deviation 2174 usec)
p50 latency: 138776 usec
p90 latency: 139160 usec
p95 latency: 139279 usec
p99 latency: 152583 usec
Avg gRPC time: 139254 usec ((un)marshal request/response 6 usec + response wait 139248 usec)
Request count: 172
Avg request latency: 138807 usec (overhead 7 usec + queue 104069 usec + compute 34731 usec)
and the nvidia-smi shown 94% GPU utilization
With cpulimit set to 50% the very same yielded:
Request concurrency: 4
Request count: 76
Throughput: 152 infer/sec
Avg latency: 248917 usec (standard deviation 213998 usec)
p50 latency: 137569 usec
p90 latency: 663526 usec
p95 latency: 663847 usec
p99 latency: 664059 usec
Avg gRPC time: 267161 usec ((un)marshal request/response 8 usec + response wait 267153 usec)
Request count: 98
Avg request latency: 266727 usec (overhead 6 usec + queue 199972 usec + compute 66749 usec)
and the nvidia-smi shown 54% GPU utilization
Stdev rose 100 times and the performance dropped by 50%.
And the last test:
Config: 2 GPU instances, 0 CPU instances
without the cpulimit:
306 infer/sec, 2 CPUs used @ 100, nvidia-smi shows 99% GPU utilization
with the cpulimit set to 100% (!!!)
Throughput: 168 infer/sec, nvidia-smi shows 49% GPU utilization
So now I have given it a full CPU (but only one) and it still fails to use full power of the GPU.
Can anyone explain the behavior? Is it a bug or is it expected?