Seeking Advice on Running Quantized Large Language Models on Jetson AGX Xavier

sarthak8 · March 18, 2024, 9:25am

Hello NVIDIA Community,

I am currently working on a project involving the deployment of large language models (LLMs) on a Jetson AGX Xavier device, with a specific focus on leveraging the device’s GPU capabilities for enhanced performance. My objective is to run quantized LLMs, especially those available from Hugging Face, locally on the Xavier and to monitor performance metrics using tools like jtop.

Initially, I experimented with Ollama, which successfully ran on the Xavier but did not utilize the GPU as expected, leading to suboptimal performance. I then explored Oobabooga, hoping it might offer a better integration with Xavier’s GPU. Unfortunately, I encountered build issues, including attempts through Docker, but without success.

Furthermore, I attempted to utilize TensorRT-LLM for optimizing LLMs for the Xavier’s GPU. This approach also didn’t yield the expected outcome, as I faced compatibility or implementation issues.

Given these challenges, I am reaching out to the NVIDIA community for guidance and suggestions:

Are there any recommended approaches or best practices for running quantized LLMs on Jetson AGX Xavier, particularly to ensure GPU utilization?
Has anyone successfully deployed quantized versions of Hugging Face models on Xavier and can share insights or potential solutions for the issues encountered?
Are there specific configurations, tools, or libraries that I might be missing, which could facilitate the deployment and efficient running of LLMs on this device?

I am keen on exploring all possible avenues to make this project a success and would greatly appreciate any advice, tips, or shared experiences that could steer me in the right direction. Time is of the essence, and any help to expedite this process would be immensely valuable.

Thank you in advance for your support and looking forward to your suggestions!

Best regards,

AastaLLL · March 19, 2024, 6:10am

Hi,

We have got a similar topic and the discussion might also help for your use case.
Please give it a check:

Thanks.

system · April 10, 2024, 5:51am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running Ollama / llama3.1 on Jetson AGX Xavier 16gb is it possible? how-to? Jetson AGX Xavier generative_ai , llama-31-8b-instruct	8	2435	October 19, 2024
Want to run a Local LLM on Nvidia Jetson AGX Orin Jetson AGX Orin generative_ai	3	3581	July 17, 2024
Can someone tell me how to benchmark LLama_v2_7b model on jetson Orin AGX with different quantization methods? NVIDIA AI Workbench jetson , generative_ai	2	89	April 3, 2025
Running LLMs with TensorRT-LLM on Nvidia Jetson AGX Orin Dev Kit Jetson Projects jetson , generative_ai	1	669	December 8, 2024
Jetson-containers local_llm not working with Jetson AGX Xavier Jetson AGX Xavier containers , generative_ai	2	443	February 6, 2024
is there a way to run the app at Xavier and not at dGPU device? Jetson AGX Xavier	2	394	October 18, 2021
Xavier AGX NanoLLM Compatible?!?! Jetson AGX Xavier generative_ai	3	321	May 15, 2024
LLMs token/sec Jetson AGX Orin generative_ai	2	1127	April 8, 2024
Ollama is running slow on Jetson AGX Orin Dev-kit (32G) Jetson AGX Orin generative_ai	2	1202	February 29, 2024
TensorRT for Large Language Models Jetson AGX Orin	2	627	September 11, 2023

Seeking Advice on Running Quantized Large Language Models on Jetson AGX Xavier

Related topics