NIM - Llama 3 8B Instruct - Results were very weirdn

Hi all,

I had tried the on-premise version of llama3-8b-instruct deployment. The deployment steps were referred from Docker area.

First of all, I observed it took a lot of GPU memory (~36GB). It seems not like the docs mentioned **24GB

Please check here

Secondly, after I successfully deployed, I used the command from the page suggested to test the model.
But the results were very unable understand. I can say super random.

{"id":"cmpl-09ce782c884f40bebb696bcdbb333eb3",
"object":"chat.completion",
"created":1718767345,
"model":"meta/llama3-8b-instruct",
"choices":[{"index":0,"message":{"role":"assistant",
"content":"Cloud bathroom Of downloaded Return to You(eny Name-------------QAJa Lifetime Caught0 HmmUCHleg Do You Had Number Daughter Onlycccc Even${Sat.\r\n\r\n ThatTag)&SaCaughtLCForgery/Hex YourFolder#{ Sonra InsidelanguagesP Love Very MajorityDiscoverHelpArm And Herencmont Alone Q_Base(Pbab Tight WhoAre"},
"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":22,"total_tokens":86,"completion_tokens":64}}%

Here is the picture regarding the log of service.

Spec

  • GPU instance: A100 40GB
  • CPU cores: 128
  • RAM: 256GB

I am wondering is there any idea about this issue?
Thank you so much.

BR,
Chieh

Just speculation, the default KV Cache setting of TensorRT-LLM is set at 90% of GPU remaining memory.
https://nvidia.github.io/TensorRT-LLM/reference/memory.html#id1

May need to specify max tokens like vllm python override, but I’m not sure how to set it with TensorRT-LLM.