Test triton with jmeter, much less throughoutput than perf-analyzer

290844930 · October 10, 2023, 7:46am

Description

model: roformer
backend: tensorrt with dynamic batch size and dynamic input length

Problem:

result of throughoutput: in jmeter(125 Infer/sec) quite less than perf_analyzer(689 Infer/Sec) in case of bachsize 1
I tried to change diverse parameters, like model instances, max_batch_size, preferred_batch_size, max_queue_delay_microseconds, I saw almost no improvement.
When much more requests come in, I did not see any improvement with dynamic batching. It seems triton infers always with one single batch showed in log. That means, triton did not gather more requests into one batch once, it runs sequentially. The GPU util also keep quite low all the time, almost 4%, even with 32 model instances.
When I changed the input shape with 32 batch in request data, the through output gave me 20 Infer/Sec, totally 640 Infer/Sec.

Environment

TensorRT Version:
GPU Type: RTX3090
Nvidia Driver Version: 535.113.01
CUDA Version: 12.2
CUDNN Version:
Operating System + Version: Ubuntu 20.04
Python Version (if applicable):
TensorFlow Version (if applicable): 3.10
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

refer to this repo: GitHub - royinx/llm_triton: LLM in Triton , Hugging Face -> Pytorch -> ONNX -> TensorRT -> Triton
convert onnx and tensort with dynamic shape, command looks like:

trtexec --onnx=$SRC_DIR/model.onnx --saveEngine=$SRC_DIR/model.plan --minShapes=input_ids:1x1,attention_mask:1x1,token_type_ids:1x1 --optShapes=input_ids:8x128,attention_mask:8x128,token_type_ids:8x128 --maxShapes=input_ids:32x128,attention_mask:32x128,token_type_ids:32x128 --memPoolSize=workspace:2048 --fp 16

deploy it in inference-server-triton:
name: “model”
platform: “tensorrt_plan”
max_batch_size : 32
dynamic_batching {
preferred_batch_size: [ 4, 8, 16, 32 ]
max_queue_delay_microseconds: 50
}

default_model_filename: “model.plan”
input [
{
name: “input_ids”
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: “attention_mask”
data_type: TYPE_INT32
dims: [ -1 ]
}
]
output [
{
name: “last_hidden_state”
data_type: TYPE_FP32
dims: [ -1, 768 ]
}
]

instance_group [
{
kind: KIND_GPU
count: 16
gpus: [0]
}
]
version_policy{all {} }

My request script looks like the following:

@app.post(‘/generate2’)
async def generate2(text:str=Body(None,title=‘text’,max_length=1000)):
triton_client = client.InferenceServerClient(url=‘triton:8000’, verbose=0)
input_ids = np.asarray([[101, 107, 232, 309, 241, 2351, 4334, 6871, 5931, 819, 5967, 629, 4977, 2755, 3418, 421, 4179, 7679, 5155, 207, 3385, 3418, 4461, 465, 2578, 6724, 2351, 7679, 107, 10]], np.int32)
#input_ids = np.tile(input_ids, (32, 1))
token_type_ids = np.zeros_like(input_ids)
attention_mask = np.ones_like(input_ids)
model_name = “model”
input_ids_shape = input_ids.shape
attention_mask_shape = attention_mask.shape
token_type_ids_shape = token_type_ids.shape

input0 = client.InferInput(input_name[0], input_ids_shape, 'INT32')
input0.set_data_from_numpy(input_ids, binary_data=True)
input1 = client.InferInput(input_name[1], token_type_ids_shape, 'INT32')
input1.set_data_from_numpy(token_type_ids, binary_data=True)
input2 = client.InferInput(input_name[2], attention_mask_shape, 'INT32')
input2.set_data_from_numpy(attention_mask, binary_data=True)

output = client.InferRequestedOutput(output_name,binary_data=True)
response = triton_client.infer(model_name,inputs=[input0,input1,input2], outputs=[output])
ans_numpy = response.as_numpy('pooler_output')
result = ans_numpy.tolist()
if not result:
     abort(400, description="Inference result is empty")
return result

if name == ‘main’:
uvicorn.run(app=app, host=‘0.0.0.0’, port=9110, log_level=‘info’)

spolisetty · November 15, 2023, 1:50pm

Please reach out to Issues · triton-inference-server/server · GitHub for better help regarding Triton.

Thank you.

Topic		Replies	Views
Large latency when using `tritonclient.http.aio.infer` TensorRT tensorrt , cudnn , inference-server-triton	1	185	June 29, 2024
Triton inference speed test TensorRT cudnn	1	368	December 22, 2023
Nvinferserver (Triton server) doesn't improves inference FPS for dynamic batching models DeepStream SDK	2	341	October 25, 2023
Issues with using multiple perf-analyzer processes for Triton Inference Server Triton Inference Server - archived performance , inference-server-triton	0	760	February 5, 2024
TensorRT model memory usage in NvInfer vs NvInferserver plugin DeepStream SDK tensorrt , nvbugs	5	593	July 10, 2023
Triton Inference Engine Tensorflow Model Configuration expects 2 inputs, model provides 1 DeepStream SDK inference-server-triton , inception	9	3703	September 19, 2022
Problem with accumulating gpu memory usage in tritonserver TensorRT cudnn , inference-server-triton , deepstream	0	117	September 3, 2024
Latency linearly increases when increased batch size or concurrent models TensorRT inference-server-triton	15	2038	September 29, 2021
Nvinferserver: doubts about programmatically generating TensorRT engines from ONNX DeepStream SDK	3	297	October 19, 2023
Latency linearly increases when increased batch size or concurrent models Tensorrt Triton Inference Server - archived tensorrt	3	1801	October 1, 2021

Test triton with jmeter, much less throughoutput than perf-analyzer

Description

Environment

Relevant Files

Related topics