OpenAI Compatible API does not work

v1/completion works normally
v1/chat/completions results in 500 errors.
I need to use /v1/chat/completion for the function call test.

InternalServerError: Error code: 500 - {‘object’: ‘error’, ‘message’: "init(): incompatible constructor arguments. The following argument types are supported:\n 1. tensorrt_llm.bindings.executor.Request(input_token_ids: list[int], max_new_tokens: int, streaming: bool = False, sampling_config: tensorrt_llm.bindings.executor.SamplingConfig = SamplingConfig(), output_config: tensorrt_llm.bindings.executor.OutputConfig = OutputConfig(), end_id: Optional[int] = None, pad_id: Optional[int] = None, bad_words: Optional[list[list[int]]] = None, stop_words: Optional[list[list[int]]] = None, embedding_bias: Optional[torch.Tensor] = None, external_draft_tokens_config: Optional[tensorrt_llm.bindings.executor.ExternalDraftTokensConfig] = None, prompt_tuning_config:

Can you confirm which NIM you are using in this example?

Thanks!

NIM_MODEL_PROFILE : “tensorrt_llm-h100-bf16-tp1-throughput”
Images : nvcr.io/nim/meta/llama-3.1-8b-instruct:latest or nvcr.io/nim/meta/llama-3.1-70instructb-:latest

I just used the NVIDIA Docs Hub Function Calling Guide (Function Calling - NVIDIA Docs)

Hi @soonh.yoon – due to a bug, the max_token parameter is required for completion and chat_completion API calls with the latest llama 3.1 models. We’ll address this in a future release but for now please ensure that the max_token parameter is set

I think it’s a problem because openai api doesn’t have a parameter called max_new_tokens?
InternalServerError: Error code: 500 - {‘object’: ‘error’, ‘message’: “init(): incompatible constructor arguments. The following argument types are supported:\n 1. tensorrt_llm.bindings.executor.Request(input_token_ids: list[int], max_new_tokens: int, streaming: bool = False, sampling_config: tensorrt_llm.bindings.executor.SamplingConfig = SamplingConfig(), output_config: tensorrt_llm.bindings.executor.OutputConfig = OutputConfig(), end_id: Optional[int] = None, pad_id: Optional[int] = None, bad_words: Optional[list[list[int]]] = None, stop_words: Optional[list[list[int]]] = None, embedding_bias: Optional[torch.Tensor] = None, external_draft_tokens_config: … , max_new_tokens=None, streaming=True, output_config=<tensorrt_llm.bindings.executor.OutputConfig object at 0x7f41bc987670>, sampling_config=<tensorrt_llm.bindings.executor.SamplingConfig object at 0x7f45d5b32370>, end_id=128009, lora_config=None, logits_post_processor_name=‘batched’”, ‘type’: ‘InternalServerError’, ‘param’: None, ‘code’: 500}

After modifying the openai api source code to put max_new_tokens, BadRequestError: Error Code: 400 - {‘object’, ‘error’, ‘message’: ‘{‘type’: ‘extra_forbidden’, ‘loc’: (‘body’, ‘max_new_tokens’) ‘msg’: ‘no additional input allowed’, ‘input’: 1024}’, ‘Type’: ‘BadRequestError’, ‘param’: none, ‘code’: 400}

Hi @soonh.yoon – in the HTTP API this parameter is called max_tokens