Early this year (2024), I was very satisfied with the performance of llama.cpp 2 and Mistral on Jetson Xavier AGX.
However, after I built the last llama.cpp code last week (August 25, 2024) to run Llama 3.1 and Phi 3.5, the loading time became unendurable. Llama-3.1-8B-Lexi-Uncensored_V2_Q8.gguf (8.5GB) took 7.5 minutes to load.
Once loaded, the inference speed is fine:
llama_print_timings: load time = 450482.81 ms
llama_print_timings: sample time = 3076.00 ms / 1308 runs ( 2.35 ms per token, 425.23 tokens per second)
llama_print_timings: prompt eval time = 14287.63 ms / 196 tokens ( 72.90 ms per token, 13.72 tokens per second)
llama_print_timings: eval time = 154754.42 ms / 1300 runs ( 119.04 ms per token, 8.40 tokens per second)
llama_print_timings: total time = 770548.21 ms / 1496 tokens
Anyone had similar experience? Not many people in llama.cpp community use Jetson, I think this may be the proper forum to ask.
Could you share the link of llama.cpp so we can give it a check?
A possible cause is that the JIT compiling is required when launching the app.
To avoid JIT, please check if you have compiled the app with Xavier GPU architecture (sm_72).
I have the same problem here, I tried both only cpu or gpu, if I compile llama-cpp-python with only cpu it loads very fast as 2 seconds the model while inference is very long as normal on cpu, but when I build with the gpu support the loading the model in the sense of offloading even with no offloading by setting to 0 layers offload is long as described by generative.cloud, I think the problem comes when I want to use gpu, I tried on pytorch the todevice and the problem is the same, so I suppose that the problem is the offloading to gpu the problem, but on Xavier it in theory not a problem since is a iGPU and shares memory with the cpu.
Thanks in advance for answers.