Hi.
I’m following this blog post to setup tensorrtllm for LLAMA3 with triton inference server, everything is going well.
The generated endpoint is http://my_ip:my_port/v2/models/ensemble/generate if I want to generate the response from client side.
According to the doc here, the “ensemble” in the API path stands for the model name, however, the model_name is “tensorrt-llm” in tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
Question:
If I want to use tensorrt-llm to compile another model (e.g., Mistral 7B) and go through the same process to launch triton inference server, how could I differentiate the generated endpoints from these 2 models without using other ports?