Description
I am currently running a Llama3 8B Instruct model on a Triton Inference Server, with the engine built by TensorRT-LLM. When I submit 13 prompts, everything works fine, but, when I submit a 14th, I get the following error:
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::validateInputBindings::1753] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::validateInputBindings::1753, condition: profileMinDims.d[i] <= dimensions.d[i] Supplied binding dimension [76] for bindings[30] exceed min ~ max range at index 0, maximum dimension in profile is 256, minimum dimension in profile is 128, but supplied dimension is 76.)
Environment
TensorRT-LLM Version: v0.10.0
GPU Type: 1x H100
Nvidia Driver Version: 550.90.07
CUDA Version: 12.4
Operating System + Version: Ubuntu 22.04
Relevant Files
N/A
Steps To Reproduce
Follow this article to build the model engine, but also using a checkpoint directory for trtllm-build
. Then, configure the models similar to this blog.
After that, host the models on a Triton Inference Server using the docker image nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
and send 14 HTTP requests with the following data:
"text_input" : f"<|start_header_id|>user<|end_header_id|>Tell me about the number {id}<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
"parameters" : {
"max_tokens" : 128,
"temperature" : 0.5
}
where id
represents which request you’re sending (i.e., 0, 1, 2, …, 13).
I believe that this has something to do with the number of tokens involved, as using different prompts and setting max_tokens
to 1024
resulted in this error occurring after only 11 prompts.