Description
model: roformer
backend: tensorrt with dynamic batch size and dynamic input length
Problem:
- result of throughoutput: in jmeter(125 Infer/sec) quite less than perf_analyzer(689 Infer/Sec) in case of bachsize 1
- I tried to change diverse parameters, like model instances, max_batch_size, preferred_batch_size, max_queue_delay_microseconds, I saw almost no improvement.
- When much more requests come in, I did not see any improvement with dynamic batching. It seems triton infers always with one single batch showed in log. That means, triton did not gather more requests into one batch once, it runs sequentially. The GPU util also keep quite low all the time, almost 4%, even with 32 model instances.
- When I changed the input shape with 32 batch in request data, the through output gave me 20 Infer/Sec, totally 640 Infer/Sec.
Environment
TensorRT Version:
GPU Type: RTX3090
Nvidia Driver Version: 535.113.01
CUDA Version: 12.2
CUDNN Version:
Operating System + Version: Ubuntu 20.04
Python Version (if applicable):
TensorFlow Version (if applicable): 3.10
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
refer to this repo: GitHub - royinx/llm_triton: LLM in Triton , Hugging Face -> Pytorch -> ONNX -> TensorRT -> Triton
convert onnx and tensort with dynamic shape, command looks like:
trtexec --onnx=$SRC_DIR/model.onnx --saveEngine=$SRC_DIR/model.plan --minShapes=input_ids:1x1,attention_mask:1x1,token_type_ids:1x1 --optShapes=input_ids:8x128,attention_mask:8x128,token_type_ids:8x128 --maxShapes=input_ids:32x128,attention_mask:32x128,token_type_ids:32x128 --memPoolSize=workspace:2048 --fp 16
deploy it in inference-server-triton:
name: “model”
platform: “tensorrt_plan”
max_batch_size : 32
dynamic_batching {
preferred_batch_size: [ 4, 8, 16, 32 ]
max_queue_delay_microseconds: 50
}
default_model_filename: “model.plan”
input [
{
name: “input_ids”
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: “attention_mask”
data_type: TYPE_INT32
dims: [ -1 ]
}
]
output [
{
name: “last_hidden_state”
data_type: TYPE_FP32
dims: [ -1, 768 ]
}
]
instance_group [
{
kind: KIND_GPU
count: 16
gpus: [0]
}
]
version_policy{all {} }
My request script looks like the following:
@app.post(‘/generate2’)
async def generate2(text:str=Body(None,title=‘text’,max_length=1000)):
triton_client = client.InferenceServerClient(url=‘triton:8000’, verbose=0)
input_ids = np.asarray([[101, 107, 232, 309, 241, 2351, 4334, 6871, 5931, 819, 5967, 629, 4977, 2755, 3418, 421, 4179, 7679, 5155, 207, 3385, 3418, 4461, 465, 2578, 6724, 2351, 7679, 107, 10]], np.int32)
#input_ids = np.tile(input_ids, (32, 1))
token_type_ids = np.zeros_like(input_ids)
attention_mask = np.ones_like(input_ids)
model_name = “model”
input_ids_shape = input_ids.shape
attention_mask_shape = attention_mask.shape
token_type_ids_shape = token_type_ids.shape
input0 = client.InferInput(input_name[0], input_ids_shape, 'INT32')
input0.set_data_from_numpy(input_ids, binary_data=True)
input1 = client.InferInput(input_name[1], token_type_ids_shape, 'INT32')
input1.set_data_from_numpy(token_type_ids, binary_data=True)
input2 = client.InferInput(input_name[2], attention_mask_shape, 'INT32')
input2.set_data_from_numpy(attention_mask, binary_data=True)
output = client.InferRequestedOutput(output_name,binary_data=True)
response = triton_client.infer(model_name,inputs=[input0,input1,input2], outputs=[output])
ans_numpy = response.as_numpy('pooler_output')
result = ans_numpy.tolist()
if not result:
abort(400, description="Inference result is empty")
return result
if name == ‘main’:
uvicorn.run(app=app, host=‘0.0.0.0’, port=9110, log_level=‘info’)