Triton Inference Server Binding Dimension Error

Description

I am currently running a Llama3 8B Instruct model on a Triton Inference Server, with the engine built by TensorRT-LLM. When I submit 13 prompts, everything works fine, but, when I submit a 14th, I get the following error:
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::validateInputBindings::1753] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::validateInputBindings::1753, condition: profileMinDims.d[i] <= dimensions.d[i] Supplied binding dimension [76] for bindings[30] exceed min ~ max range at index 0, maximum dimension in profile is 256, minimum dimension in profile is 128, but supplied dimension is 76.)

Environment

TensorRT-LLM Version: v0.10.0
GPU Type: 1x H100
Nvidia Driver Version: 550.90.07
CUDA Version: 12.4
Operating System + Version: Ubuntu 22.04

Relevant Files

N/A

Steps To Reproduce

Follow this article to build the model engine, but also using a checkpoint directory for trtllm-build. Then, configure the models similar to this blog.

After that, host the models on a Triton Inference Server using the docker image nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 and send 14 HTTP requests with the following data:

"text_input" : f"<|start_header_id|>user<|end_header_id|>Tell me about the number {id}<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
"parameters" : {
    "max_tokens" : 128,
    "temperature" : 0.5
}

where id represents which request you’re sending (i.e., 0, 1, 2, …, 13).

I believe that this has something to do with the number of tokens involved, as using different prompts and setting max_tokensto 1024resulted in this error occurring after only 11 prompts.