Performance Issues with LLM model on NVIDIA Jetson Orin NX (16GB)

kuanmingchen · June 13, 2024, 3:24am

Dear NVIDIA Support Team,
I am currently utilizing an NVIDIA Jetson Orin NX (16GB) device to run the LLM model llam-2-7b-chat.Q4_K_S.gguf on text-generation-webui.
However, I am experiencing extremely slow text generation performance, with the rate being approximately 2 words per second.
I have attached a screenshot illustrating the model loading and the generated result for your reference.
Could you please advise if there are any specific drivers, software optimizations, or other actions I could take to enhance the performance? I am particularly interested in any updates or configurations that could better leverage the capabilities of the Jetson Orin NX for this task.

dusty_nv · June 13, 2024, 5:19pm

Hi @kuanmingchen , it would appear that you set n_gpu_layers setting for llama_cpp loader in text-generation-webui, so it would seem to be using GPU. If you run multiple requests, is it just the first one that is that slow? Sometimes the first run of a model takes a lot longer, as the CUDA kernels/memory are loaded on to the device (so there is typically a ‘warmup’ run of the model while the application is initializing)

Also, while this performance is below expectation for llama.cpp, llama.cpp is not the most optimized LLM API, and ultimately you would get higher performance from using MLC, AWQ, or exllama (if you look around the jetson-containers, there are builds available for these too)

system · July 3, 2024, 7:16am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson orin nano local small models perform insanely slow Jetson Orin Nano generative_ai	2	648	June 6, 2024
Problems with "Tutorial - text-generation-webui" Jetson Orin Nano generative_ai	6	354	February 24, 2025
Jetson Container text-generation-webui not loading models Jetson AGX Orin generative_ai	4	503	June 5, 2024
Jetpack6.2+TensorRT OOM issue Jetson Orin Nano generative_ai , llama	7	189	February 21, 2025
Unable to Utilize GPU for LLM on NVIDIA Jetson AGX Orin Jetson AGX Orin generative_ai	4	233	July 4, 2024
Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB) Jetson AGX Orin generative_ai	8	264	May 12, 2025
Llama.cpp loading Llama 3.1 very slow on Jetson Xavier AGX Jetson AGX Xavier jetson-inference , generative_ai , llama	4	505	November 2, 2024
Problem: slow LLM inference speed on Jetson AGX Orin 64GB Jetson AGX Orin jetson-inference , generative_ai	2	278	April 8, 2025
Unable to Utilize GPU for LLM on NVIDIA Jetson AGX Orin Jetson AGX Orin generative_ai	4	241	July 4, 2024
Issue with Nvidia Jetson AGX Orin Developer Kit (64 Gb) Jetson AGX Orin cuda , generative_ai	4	56	July 9, 2025

Performance Issues with LLM model on NVIDIA Jetson Orin NX (16GB)

Related topics