Large latency when using `tritonclient.http.aio.infer`

Description

I’m running an inference server on nvcr.io/nvidia/tritonserver:24.03-py3 container. I converted a [ 003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth model to tensorrt engine, and static input shapeinput_0:1x3x512x512, its original repo is JingyunLiang/SwinIR: SwinIR: Image Restoration Using Swin Transformer (official repository) (github.com).
And I use tritonclient package to invoke infer request. When I use tritonclient.http.aioand call async request with response = await client.infer(), it will take a longer time for triton server to release the output buffer compare to sync tritonclient.http.infer(). To be more precise, triton server will hold the request’s output buffer for a long time before seeting state from EXECUTING to RELEASED, and then client can get the response. Below is the verbose log from triton server:
=========sync infer log: cost 1 second for http release============

I0612 01:24:07.399564 767 http_server.cc:4522] HTTP request: 2 /v2/models/SwinIR_realSR_s64w8_4x_512x512/infer
I0612 01:24:07.399802 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I0612 01:24:07.399826 767 infer_request.cc:900] [request id: <id_unknown>] prepared: [0x0x7f9fd4002ad0] request id: , model: SwinIR_realSR_s64w8_4x_512x512, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f9fd4004978] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
override inputs:
inputs:
[0x0x7f9fd4004978] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
original requested outputs:
requested outputs:
output_0

I0612 01:24:07.399865 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I0612 01:24:07.400051 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I0612 01:24:07.400112 767 tensorrt.cc:390] model SwinIR_realSR_s64w8_4x_512x512, instance SwinIR_realSR_s64w8_4x_512x512_0, executing 1 requests
I0612 01:24:07.400129 767 instance_state.cc:361] TRITONBACKEND_ModelExecute: Issuing SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:24:07.400143 767 instance_state.cc:410] TRITONBACKEND_ModelExecute: Running SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:24:07.400270 767 instance_state.cc:1450] Optimization profile default [0] is selected for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:24:07.400324 767 pinned_memory_manager.cc:198] pinned memory allocation: size 3145728, addr 0x7fa704000090
I0612 01:24:07.400921 767 instance_state.cc:911] Context with profile default [0] is being executed for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:24:08.091022 767 infer_response.cc:174] add response output: output: output_0, type: FP32, shape: [1,3,2048,2048]
I0612 01:24:08.091061 767 http_server.cc:1217] HTTP: unable to provide 'output_0' in GPU, will use CPU
I0612 01:24:08.091093 767 http_server.cc:1237] HTTP using buffer for: 'output_0', size: 50331648, addr: 0x7f9f8bfff040
I0612 01:24:08.091108 767 pinned_memory_manager.cc:198] pinned memory allocation: size 50331648, addr 0x7fa7043000a0
I0612 01:24:09.044377 767 http_server.cc:1311] HTTP release: size 50331648, addr 0x7f9f8bfff040
I0612 01:24:09.044443 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I0612 01:24:09.044455 767 instance_state.cc:1307] TRITONBACKEND_ModelExecute: model SwinIR_realSR_s64w8_4x_512x512_0 released 1 requests
I0612 01:24:09.044461 767 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7fa7043000a0
I0612 01:24:09.044474 767 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7fa704000090

============async infer log: cost 4 seconds for http release==================

I0612 01:43:29.902539 814 http_server.cc:4522] HTTP request: 2 /v2/models/SwinIR_realSR_s64w8_4x_512x512/infer
I0612 01:43:29.902675 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I0612 01:43:29.902687 814 infer_request.cc:900] [request id: <id_unknown>] prepared: [0x0x7f400c003200] request id: , model: SwinIR_realSR_s64w8_4x_512x512, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f400c013c58] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
override inputs:
inputs:
[0x0x7f400c013c58] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
original requested outputs:
requested outputs:
output_0

I0612 01:43:29.902714 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I0612 01:43:29.902825 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I0612 01:43:29.902941 814 tensorrt.cc:390] model SwinIR_realSR_s64w8_4x_512x512, instance SwinIR_realSR_s64w8_4x_512x512_0, executing 1 requests
I0612 01:43:29.902997 814 instance_state.cc:361] TRITONBACKEND_ModelExecute: Issuing SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:43:29.903022 814 instance_state.cc:410] TRITONBACKEND_ModelExecute: Running SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:43:29.903180 814 instance_state.cc:1450] Optimization profile default [0] is selected for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:43:29.903245 814 pinned_memory_manager.cc:198] pinned memory allocation: size 3145728, addr 0x7f4736000090
I0612 01:43:29.903906 814 instance_state.cc:911] Context with profile default [0] is being executed for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:43:30.594829 814 infer_response.cc:174] add response output: output: output_0, type: FP32, shape: [1,3,2048,2048]
I0612 01:43:30.594875 814 http_server.cc:1217] HTTP: unable to provide 'output_0' in GPU, will use CPU
I0612 01:43:30.594912 814 http_server.cc:1237] HTTP using buffer for: 'output_0', size: 50331648, addr: 0x7f3fd3fff040
I0612 01:43:30.594927 814 pinned_memory_manager.cc:198] pinned memory allocation: size 50331648, addr 0x7f47363000a0
I0612 01:43:34.582690 814 http_server.cc:1311] HTTP release: size 50331648, addr 0x7f3fd3fff040
I0612 01:43:34.582782 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I0612 01:43:34.582798 814 instance_state.cc:1307] TRITONBACKEND_ModelExecute: model SwinIR_realSR_s64w8_4x_512x512_0 released 1 requests
I0612 01:43:34.582807 814 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f47363000a0
I0612 01:43:34.582828 814 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f4736000090

Environment

TensorRT Version:
GPU Type: NVIDIA A40
Nvidia Driver Version: 530.30.02
CUDA Version: 12.4.99
CUDNN Version: 9.0.0
Operating System + Version: Ubuntu 22.04.4 LTS (x86_64)
Python Version (if applicable): 3.10.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Hi @RicardoLu ,
Apologies for delay,
Please raise the concern on TRITON Github issues page.