Llama.cpp loading Llama 3.1 very slow on Jetson Xavier AGX

generative.cloud · September 1, 2024, 6:43pm

Early this year (2024), I was very satisfied with the performance of llama.cpp 2 and Mistral on Jetson Xavier AGX.

However, after I built the last llama.cpp code last week (August 25, 2024) to run Llama 3.1 and Phi 3.5, the loading time became unendurable. Llama-3.1-8B-Lexi-Uncensored_V2_Q8.gguf (8.5GB) took 7.5 minutes to load.

Once loaded, the inference speed is fine:

llama_print_timings:        load time =  450482.81 ms
llama_print_timings:      sample time =    3076.00 ms /  1308 runs   (    2.35 ms per token,   425.23 tokens per second)
llama_print_timings: prompt eval time =   14287.63 ms /   196 tokens (   72.90 ms per token,    13.72 tokens per second)
llama_print_timings:        eval time =  154754.42 ms /  1300 runs   (  119.04 ms per token,     8.40 tokens per second)
llama_print_timings:       total time =  770548.21 ms /  1496 tokens

Anyone had similar experience? Not many people in llama.cpp community use Jetson, I think this may be the proper forum to ask.

AastaLLL · September 2, 2024, 5:56am

Hi,

Could you share the link of llama.cpp so we can give it a check?

A possible cause is that the JIT compiling is required when launching the app.
To avoid JIT, please check if you have compiled the app with Xavier GPU architecture (sm_72).

Thanks.

generative.cloud · September 22, 2024, 4:51am

Sorry for the late response.

I followed the building step in this link: Llama.cpp Build. I tried both make and cmake, results are the same.

CUDA_ARCHITECTURES is set to 72. The model running speed is fine once it’s loaded, I’d assume the setting is correct. Only the loading is slow.

c.liu4 · October 19, 2024, 10:37am

I have the same problem here, I tried both only cpu or gpu, if I compile llama-cpp-python with only cpu it loads very fast as 2 seconds the model while inference is very long as normal on cpu, but when I build with the gpu support the loading the model in the sense of offloading even with no offloading by setting to 0 layers offload is long as described by generative.cloud, I think the problem comes when I want to use gpu, I tried on pytorch the todevice and the problem is the same, so I suppose that the problem is the offloading to gpu the problem, but on Xavier it in theory not a problem since is a iGPU and shares memory with the cpu.
Thanks in advance for answers.

system · November 2, 2024, 10:37am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow model loading on a Jetson AGX Xavier with TensorFlow 2.5.0 Jetson AGX Xavier cuda , tensorflow	13	2423	November 10, 2021
Are these timings normal for Jetson Xavier AGX? Jetson AGX Xavier	7	586	November 4, 2019
Running Ollama / llama3.1 on Jetson AGX Xavier 16gb is it possible? how-to? Jetson AGX Xavier generative_ai , llama-31-8b-instruct	8	2523	October 19, 2024
Simple CUDA example 4x slower on Xavier AGX GPU than CPU Jetson AGX Xavier cuda	3	551	October 18, 2021
Simple CUDA Program has slow runtime Jetson AGX Xavier cuda	6	960	October 18, 2021
Jetson AGX Xavier - No CUDA-capable device is detected Jetson AGX Xavier cuda	2	2990	October 18, 2021
Jetson AGX Xavier: slow inference using CUDA and PyTorch Jetson AGX Xavier cuda , pytorch	4	1696	October 18, 2021
Fail to run the Two Days to a Demo on Xavier Jetson AGX Xavier	5	2434	October 18, 2021
How to improve py-faster-caffe performance on JTX1? Jetson TX1	7	1781	October 18, 2021
Jetson Xavier performance issue (jumping inference time) Jetson AGX Xavier jetson-inference	8	872	October 18, 2021

Llama.cpp loading Llama 3.1 very slow on Jetson Xavier AGX

Related topics