Description
Hi all I am trying to gauge triton-inference-server performance on my hardware with the models I have and the desire is to get as many requests as possible for all of the models that I have.
The triton-inference-server is deployed as a docker container with no changes and provided a model repository path containing a 4 models.
I attempted to use perf-analyzer for all 4 models by deploying 4 different docker containers with the tritonserver:23.03-py-sdk image and executed the below in each at nearly the same time:
perf_analyzer -m MODELNAME -b 1 --shape input_1:1,1 --shape input_2:1,12345 --request-rate-range 10 -u triton-host:8001 -i gRPC --measurement-interval 20000
I found that none of the requests got completed and after a while I would get the below in each of the containers where I executed perf-analyzer:
No valid requests recorded within the time interval. Please use a larger time window
And I have used larger windows as well, yet no success.
I also tried --concurrency-range 4
instead of --request-rate-range 10
and other variations of these parameters but this time the triton-inference-server crashed with Segmentation fault (Signal (11) received). This behavior sometimes seemed to happen even with a single container executing perf_analyzer.
Hence, I have the following questions:
- Is my approach itself at fault here ? i.e trying to run multiple perf_analyzers at once. ??
- If yes, how should I go about finding the performance of triton with different models being requested for inference at nearly the same time ?
- Any other solutions, suggestions or tips to understand what might have gone wrong here ?
Environment
TensorRT Version: TensorRT 8.5.3
GPU Type: T4
Nvidia Driver Version: 470.182.03
CUDA Version: 12.1
CUDNN Version: 8.8.01
Operating System + Version: Ubuntu 20.04
Python Version (if applicable): Python 3.8
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tritonserver:23.03-py3