Performance Issues with LLM model on NVIDIA Jetson Orin NX (16GB)

Dear NVIDIA Support Team,
I am currently utilizing an NVIDIA Jetson Orin NX (16GB) device to run the LLM model llam-2-7b-chat.Q4_K_S.gguf on text-generation-webui.
However, I am experiencing extremely slow text generation performance, with the rate being approximately 2 words per second.
I have attached a screenshot illustrating the model loading and the generated result for your reference.
Could you please advise if there are any specific drivers, software optimizations, or other actions I could take to enhance the performance? I am particularly interested in any updates or configurations that could better leverage the capabilities of the Jetson Orin NX for this task.

Hi @kuanmingchen , it would appear that you set n_gpu_layers setting for llama_cpp loader in text-generation-webui, so it would seem to be using GPU. If you run multiple requests, is it just the first one that is that slow? Sometimes the first run of a model takes a lot longer, as the CUDA kernels/memory are loaded on to the device (so there is typically a ‘warmup’ run of the model while the application is initializing)

Also, while this performance is below expectation for llama.cpp, llama.cpp is not the most optimized LLM API, and ultimately you would get higher performance from using MLC, AWQ, or exllama (if you look around the jetson-containers, there are builds available for these too)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.