NIM - Llama 3 8B Instruct - Results were very weirdn

chieh_tsai1 · June 19, 2024, 3:27am

Hi all,

I had tried the on-premise version of llama3-8b-instruct deployment. The deployment steps were referred from Docker area.

First of all, I observed it took a lot of GPU memory (~36GB). It seems not like the docs mentioned **24GB

Please check here

Secondly, after I successfully deployed, I used the command from the page suggested to test the model.
But the results were very unable understand. I can say super random.

{"id":"cmpl-09ce782c884f40bebb696bcdbb333eb3",
"object":"chat.completion",
"created":1718767345,
"model":"meta/llama3-8b-instruct",
"choices":[{"index":0,"message":{"role":"assistant",
"content":"Cloud bathroom Of downloaded Return to You(eny Name-------------QAJa Lifetime Caught0 HmmUCHleg Do You Had Number Daughter Onlycccc Even${Sat.\r\n\r\n ThatTag)&SaCaughtLCForgery/Hex YourFolder#{ Sonra InsidelanguagesP Love Very MajorityDiscoverHelpArm And Herencmont Alone Q_Base(Pbab Tight WhoAre"},
"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":22,"total_tokens":86,"completion_tokens":64}}%

Here is the picture regarding the log of service.

Spec

GPU instance: A100 40GB
CPU cores: 128
RAM: 256GB

I am wondering is there any idea about this issue?
Thank you so much.

BR,
Chieh

ziling · August 27, 2024, 10:49pm

Just speculation, the default KV Cache setting of TensorRT-LLM is set at 90% of GPU remaining memory.
https://nvidia.github.io/TensorRT-LLM/reference/memory.html#id1

May need to specify max tokens like vllm python override, but I’m not sure how to set it with TensorRT-LLM.

Topic		Replies	Views
NIM - Llama3-8b-Instruct - GPU resource usage is very high Models nim , llama3-8b-instruct	0	93	March 12, 2025
Problem with installation of Llama 3.1 8b NIM Models nim , llama3-8b-instruct , llama-31-8b-instruct , llama	1	671	September 4, 2024
Issues while starting NIM container in A10 VM Models nim , llama3-8b-instruct	4	253	September 4, 2024
NIM Llama3 8B Instruct - Running container with "CUDA_ERROR_NO_DEVICE" cuDNN docker , nim , llama3-8b-instruct	1	118	March 28, 2025
NVIDIA NIM Container with CUDA out of Memory Problem Docker and NVIDIA Docker cuda , ubuntu , docker , nim , llama3-8b-instruct	2	741	September 20, 2024
Nemollm-inference-microservice failed to deploy Models nim , llama3-8b-instruct , llama	1	249	October 22, 2024
NIM does not support llama-3.1-8b-instruct and llama-3.1-70b-instruct on GH200 On-Prem deployment Models nim , llama-31-8b-instruct , llama	1	362	November 7, 2024
NIM with llama-3-8b models stuck without any error Models nim , llama3-8b-instruct , llama	0	226	November 15, 2024
CUDA fail start. Local NIM Containers run failed CUDA Setup and Installation nim , llama-31-405b-instruct , llama	2	314	September 20, 2024
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	62	4444	August 28, 2024

NIM - Llama 3 8B Instruct - Results were very weirdn

Spec

Related topics