Seeking Advice on Running Quantized Large Language Models on Jetson AGX Xavier

Hello NVIDIA Community,

I am currently working on a project involving the deployment of large language models (LLMs) on a Jetson AGX Xavier device, with a specific focus on leveraging the device’s GPU capabilities for enhanced performance. My objective is to run quantized LLMs, especially those available from Hugging Face, locally on the Xavier and to monitor performance metrics using tools like jtop.

Initially, I experimented with Ollama, which successfully ran on the Xavier but did not utilize the GPU as expected, leading to suboptimal performance. I then explored Oobabooga, hoping it might offer a better integration with Xavier’s GPU. Unfortunately, I encountered build issues, including attempts through Docker, but without success.

Furthermore, I attempted to utilize TensorRT-LLM for optimizing LLMs for the Xavier’s GPU. This approach also didn’t yield the expected outcome, as I faced compatibility or implementation issues.

Given these challenges, I am reaching out to the NVIDIA community for guidance and suggestions:

  1. Are there any recommended approaches or best practices for running quantized LLMs on Jetson AGX Xavier, particularly to ensure GPU utilization?
  2. Has anyone successfully deployed quantized versions of Hugging Face models on Xavier and can share insights or potential solutions for the issues encountered?
  3. Are there specific configurations, tools, or libraries that I might be missing, which could facilitate the deployment and efficient running of LLMs on this device?

I am keen on exploring all possible avenues to make this project a success and would greatly appreciate any advice, tips, or shared experiences that could steer me in the right direction. Time is of the essence, and any help to expedite this process would be immensely valuable.

Thank you in advance for your support and looking forward to your suggestions!

Best regards,

Hi,

We have got a similar topic and the discussion might also help for your use case.
Please give it a check:

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.