Large latency when using `tritonclient.http.aio.infer`

RicardoLu · June 13, 2024, 6:36am

Description

I’m running an inference server on nvcr.io/nvidia/tritonserver:24.03-py3 container. I converted a [ 003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth model to tensorrt engine, and static input shapeinput_0:1x3x512x512, its original repo is JingyunLiang/SwinIR: SwinIR: Image Restoration Using Swin Transformer (official repository) (github.com).
And I use tritonclient package to invoke infer request. When I use tritonclient.http.aioand call async request with response = await client.infer(), it will take a longer time for triton server to release the output buffer compare to sync tritonclient.http.infer(). To be more precise, triton server will hold the request’s output buffer for a long time before seeting state from EXECUTING to RELEASED, and then client can get the response. Below is the verbose log from triton server:
=========sync infer log: cost 1 second for http release============

I0612 01:24:07.399564 767 http_server.cc:4522] HTTP request: 2 /v2/models/SwinIR_realSR_s64w8_4x_512x512/infer
I0612 01:24:07.399802 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I0612 01:24:07.399826 767 infer_request.cc:900] [request id: <id_unknown>] prepared: [0x0x7f9fd4002ad0] request id: , model: SwinIR_realSR_s64w8_4x_512x512, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f9fd4004978] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
override inputs:
inputs:
[0x0x7f9fd4004978] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
original requested outputs:
requested outputs:
output_0

I0612 01:24:07.399865 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I0612 01:24:07.400051 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I0612 01:24:07.400112 767 tensorrt.cc:390] model SwinIR_realSR_s64w8_4x_512x512, instance SwinIR_realSR_s64w8_4x_512x512_0, executing 1 requests
I0612 01:24:07.400129 767 instance_state.cc:361] TRITONBACKEND_ModelExecute: Issuing SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:24:07.400143 767 instance_state.cc:410] TRITONBACKEND_ModelExecute: Running SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:24:07.400270 767 instance_state.cc:1450] Optimization profile default [0] is selected for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:24:07.400324 767 pinned_memory_manager.cc:198] pinned memory allocation: size 3145728, addr 0x7fa704000090
I0612 01:24:07.400921 767 instance_state.cc:911] Context with profile default [0] is being executed for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:24:08.091022 767 infer_response.cc:174] add response output: output: output_0, type: FP32, shape: [1,3,2048,2048]
I0612 01:24:08.091061 767 http_server.cc:1217] HTTP: unable to provide 'output_0' in GPU, will use CPU
I0612 01:24:08.091093 767 http_server.cc:1237] HTTP using buffer for: 'output_0', size: 50331648, addr: 0x7f9f8bfff040
I0612 01:24:08.091108 767 pinned_memory_manager.cc:198] pinned memory allocation: size 50331648, addr 0x7fa7043000a0
I0612 01:24:09.044377 767 http_server.cc:1311] HTTP release: size 50331648, addr 0x7f9f8bfff040
I0612 01:24:09.044443 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I0612 01:24:09.044455 767 instance_state.cc:1307] TRITONBACKEND_ModelExecute: model SwinIR_realSR_s64w8_4x_512x512_0 released 1 requests
I0612 01:24:09.044461 767 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7fa7043000a0
I0612 01:24:09.044474 767 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7fa704000090

============async infer log: cost 4 seconds for http release==================

I0612 01:43:29.902539 814 http_server.cc:4522] HTTP request: 2 /v2/models/SwinIR_realSR_s64w8_4x_512x512/infer
I0612 01:43:29.902675 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I0612 01:43:29.902687 814 infer_request.cc:900] [request id: <id_unknown>] prepared: [0x0x7f400c003200] request id: , model: SwinIR_realSR_s64w8_4x_512x512, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f400c013c58] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
override inputs:
inputs:
[0x0x7f400c013c58] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
original requested outputs:
requested outputs:
output_0

I0612 01:43:29.902714 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I0612 01:43:29.902825 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I0612 01:43:29.902941 814 tensorrt.cc:390] model SwinIR_realSR_s64w8_4x_512x512, instance SwinIR_realSR_s64w8_4x_512x512_0, executing 1 requests
I0612 01:43:29.902997 814 instance_state.cc:361] TRITONBACKEND_ModelExecute: Issuing SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:43:29.903022 814 instance_state.cc:410] TRITONBACKEND_ModelExecute: Running SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:43:29.903180 814 instance_state.cc:1450] Optimization profile default [0] is selected for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:43:29.903245 814 pinned_memory_manager.cc:198] pinned memory allocation: size 3145728, addr 0x7f4736000090
I0612 01:43:29.903906 814 instance_state.cc:911] Context with profile default [0] is being executed for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:43:30.594829 814 infer_response.cc:174] add response output: output: output_0, type: FP32, shape: [1,3,2048,2048]
I0612 01:43:30.594875 814 http_server.cc:1217] HTTP: unable to provide 'output_0' in GPU, will use CPU
I0612 01:43:30.594912 814 http_server.cc:1237] HTTP using buffer for: 'output_0', size: 50331648, addr: 0x7f3fd3fff040
I0612 01:43:30.594927 814 pinned_memory_manager.cc:198] pinned memory allocation: size 50331648, addr 0x7f47363000a0
I0612 01:43:34.582690 814 http_server.cc:1311] HTTP release: size 50331648, addr 0x7f3fd3fff040
I0612 01:43:34.582782 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I0612 01:43:34.582798 814 instance_state.cc:1307] TRITONBACKEND_ModelExecute: model SwinIR_realSR_s64w8_4x_512x512_0 released 1 requests
I0612 01:43:34.582807 814 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f47363000a0
I0612 01:43:34.582828 814 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f4736000090

Environment

TensorRT Version:
GPU Type: NVIDIA A40
Nvidia Driver Version: 530.30.02
CUDA Version: 12.4.99
CUDNN Version: 9.0.0
Operating System + Version: Ubuntu 22.04.4 LTS (x86_64)
Python Version (if applicable): 3.10.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

AakankshaS · June 29, 2024, 10:07pm

Hi @RicardoLu ,
Apologies for delay,
Please raise the concern on TRITON Github issues page.

Topic		Replies	Views
Trying to run TensorFlow 1.15 and 2.4.1 produced graphdefs with TF2 based tensorRT but Triton Server inference not working correctly Triton Inference Server - archived tensorrt , tensorflow , python , inference-server-triton , machine-learning	0	938	May 24, 2021
Triton inference server is sending back "HTTP/1.1 400 Bad Request" TAO Toolkit	6	3307	October 12, 2021
Inferencing on DINO in triton inference server TensorRT inference-server-triton	1	47	August 29, 2024
`Error No Op registered for NMSDynamic_TRT...` when trying to run Trition inference server with a SSD model TAO Toolkit jetson	12	1176	October 12, 2023
Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly TensorRT	6	983	July 15, 2021
Generation of Triton Inference Server configuration for TensorRT exported model of TAO classification (resnet) TAO Toolkit tensorrt , inference-server-triton , tao	7	2580	June 23, 2022
Error in inferencing using a onnx faster rcnn model DeepStream SDK	10	1513	October 12, 2021
Utilizing Inference server for multi-batch processing with deepstream DeepStream SDK gstreamer , inference-server-triton , deepstream61	11	1051	October 19, 2023
Test triton with jmeter, much less throughoutput than perf-analyzer TensorRT inference-server-triton	1	459	November 15, 2023
Triton infererence server example 'simple_grpc_infer_client.py' DeepStream SDK	11	4806	March 23, 2022

Large latency when using `tritonclient.http.aio.infer`

Description

Environment

Related topics