LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui

dusty_nv · September 12, 2023, 2:45am

llava-llama-2-13b-chat-lightning-gptq through oogabooga: RAM usage went from 14.17GB → 20.39GB (6.22GB), but that seems low so take it with a grain of salt. That was after querying it on an image a few times. Think it should run in 16GB though.

shahizat · September 17, 2023, 4:25pm

Hi, @dusty_nv, LlamaSpeak project is cool and fast. Am I correct in saying that it can also run on x86 standalone PCs with Nvidia graphics card? This project has a bright future, especially it can be useful for people with visual impairments.

Here are the two optimizations I performed on my NVIDIA Jetson AGX Orin 64GB Developer Kit:

I disabled the graphical interface to free up RAM memory.
I turned on the MAXN mode.

See the video below

The usage of RAM

Thank you!

dusty_nv · September 17, 2023, 8:44pm

Hi @shahizat, glad you got it working! Yes, if you run riva container and text-generation-webui on x86, it will run on x86 also. Someone at NVIDIA did that too.

I have been circling back to do another round of LLM performance optimizations (including time-to-first-token latency during the prefill stage), but will then continue work on the project.

dusty_nv · September 22, 2023, 3:18pm

Wanted to post some recent updates to the thread!

AWQ/TinyChat added some optimizations, and is now ~40% faster than llama.cpp
MLC/TVM I got working, and is ~60% faster than llama.cpp. Low prefill latency.
We released interactive Llava-Llama-2-chat-GPTQ tutorial (uses oogabooga)
I have an even faster llava-llama-2-chat integration with MLC and realtime CLIP encoder that I’m packaging up
Realtime NanoSAM! https://github.com/NVIDIA-AI-IOT/nanosam
I’ve been working on local_llm mini-package which supports AWQ/MLC (not in oogabooga) and works at the embedding level with unstructured chat for flexible multimodality
I’ve started work on lightweight multimodal vectorDB that can handle high-dimensional embeddings like images/audio/ect in addition to chat, using FAISS/RAFT and CUDA underneath for similarity search. With efficient memory management and zero-copy.

shahizat · September 22, 2023, 5:27pm

Hello, @dusty_nv. Thank you for your updates. I have just tested your 13B llava-llama-2 model example, and it is working very well. The results are impressive and provide a comprehensive description of the image.

The RAM was almost occupied, while it was running 13B parameter llava-llama-2 model

Dustin, are you aware of the reason why NVMe SSDs on Jetson Orin devices cannot achieve their full speed? Declared speed is up to 7GB/s, and I have verified that all firmware is up to date, the latest and greatest. I know that I am attempting to offload RAM at a rate of 204.8GB/s to an NVMe SSD with a speed of 3-7GB/s may not be reasonable at all. I observed that Jim from Jetson Hacks used a different NVMe SSD and achieved the same result as I did. Is it somehow related to Jetson architecture limitation?

Here is the benchmarks results of my NVMe SSD.

dusty_nv · September 22, 2023, 5:37pm

@shahizat glad you got the quantized llava-2 example working! I would post a new topic about your NVME performance so our hardware guys can take a look - I just use Samsung 970’s which I don’t believe have the same level of performance. You might want to check lspci to see what PCIe gen it’s linked at.

robert.semmler1000 · September 25, 2023, 6:45am

Hi @dusty_nv ,I am using your docker image on a AGX Xavier 16GB. I get 3t/s max with a13B Model with webui. My goal would be to reach 7t/s. Is that possible? thx

romilly.cocking · September 25, 2023, 2:10pm

Hi @dusty_nv, I’ve been experimenting with the current python interface to llama.cpp on a 16GB Xavier AGX and I’m impressed with the results.

I hit one or two minor problems getting it going; I couldn’t build the code using make, and default cmake was back-levelled, but installing the latest cmake from source fixed the build.

It runs llama-2-13b-chat.ggmlv3.q6_K.ggufwell, giving me about 5 tokens/sec. If only I had an Orin :)

I’ve also run codellama-7b-instruct.Q6_K.ggufwith a 7k context length: that’s enough to tackle some serious coding.

dusty_nv · September 26, 2023, 1:02am

Hi @robert.semmler1000, I’m not sure, but which loader and 13B model (quantized/ect) are you using?

Hi @romilly.cocking, here are the steps I use to build llama_cpp_python (from Dockerfile): https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/llama_cpp/Dockerfile

That’s good to hear!, I’ve not yet tried codellama, although I did get llama-2 to write code for calling plugins that I defined for it in the system prompt (interspersed with chat). If you find a good way of using it let us know!

robert.semmler1000 · September 26, 2023, 1:49am

TheBloke/Llama-2-13B-chat-GPTQ · Hugging Face (just put it into the download text field) with ExllamaHF. It tells me an urllib and python version problem for exllamahf but it works.

jschw · September 26, 2023, 9:35am

Hi @shahizat and @dusty_nv , can you please share a link to the llamaspeak project which is shown in the video?
Can’t find a repository or something and it looks really awesome!

shahizat · September 26, 2023, 3:03pm

Hi @jschw, here is the link: https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/llamaspeak

dusty_nv · September 26, 2023, 6:00pm

@robert.semmler1000 just FYI, I get ~40% better performance from llama.cpp and GGML/GGUF models than exllama on GPTQ models

robert.semmler1000 · September 27, 2023, 6:34am

Thank You! I use here Docker this image then dustynv/llama_cpp:ggml-r35.4.1. Do you use it in terminal only or with an UI? NVM i found it https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/llama_cpp

dusty_nv · September 27, 2023, 1:08pm

@robert.semmler1000 yes that one is just the core llama.cpp / llama_cpp_python packages, typically used over CLI (or with another app you write using those libraries). Or you can use text-generation-webui container which includes llama_cpp container and you can use the llama_cpp loader in oogabooga.

I keep llama_cpp:ggml for people still using GGML models, but they are easy to convert to GGUF (and TheBloke has lots of GGUF on Huggingface Hub already). llama_cpp:gguf tracks the upstream repos and is what the text-generation-webui container uses to build.

tarek.gas · September 27, 2023, 6:45pm

Hi @dusty_nv
I would like to thank you for the cool updates. I’m particularly interested in the deployment of Llava on Jetson board. I’m planning to order a Jetson Orin NX 16 gb to carry out the projects but I’ve seen that you recommend an AGX 32 gb as minimum requirements in your tutorial. Is there any chance for a quantized Llava to run on the Orin NX 16 module, or should I upgrade to the AGX 32?

dusty_nv · September 27, 2023, 7:19pm

@tarek.gas assuming you mean llava-13B, it’s difficult to pinpoint the precise memory usage including the CLIPEncoder and webui, but yea I think it would run in 16GB:

Certainly I’d think llava-7b wouldn’t be an issue, and I think Orin NX 16GB is a good option for deploying LLMs/ect in the Nano form-factor. I don’t think the AGX Orin 32GB devkits are sold anymore, they are all the 64GB devkits now (which admittedly is nice to have for experimenting and not worrying about memory usage, and it builds code fast)

herr_dieter_graef · September 27, 2023, 8:40pm

The models from DECI do run on my orin NX nicely out of the box Deci (Deci AI)

thebigboss84 · September 28, 2023, 5:39am

Any model that can work on my jetson nano 4gb ?

dusty_nv · September 28, 2023, 12:49pm

@thebigboss84 you might want to try the Microsoft Phi models:

They are 1.3B parameters, but are said to be better than normal for models of that size due to the quality of the training dataset.

Also there is the TinyLlama-1.1B project that is ongoing, last I checked it was not yet producing coherent output as it’s still being trained:

Topic		Replies	Views
Introducing Ollama Support for Jetson Devices Jetson Projects cuda , natural-language-processing-nlp , artificialintelligence , interactive , docker-machine-learning , generative_ai	29	11689	August 28, 2024
Ollama and Jetson issue Jetson Orin NX jetson-inference , generative_ai	12	5418	March 20, 2024
NVIDIA Jetson Nano 2GB Developer Kit available now Jetson Nano	79	6349	March 10, 2022
Jetson AI Lab - Home Assistant Integration Jetson Projects generative_ai	63	12413	April 9, 2025
TensorRT-LLM for jetson errors Jetson AGX Orin generative_ai , paligemma , kosmos-2 , llama	14	529	January 16, 2025
Live Llava on Orin Jetson Projects generative_ai	20	2199	March 13, 2025
Jetpack6.2+TensorRT OOM issue Jetson Orin Nano generative_ai , llama	7	166	February 21, 2025
MiniGPT-4 on Jetson Orin Nano 8Gb Dev kit not working Jetson Orin Nano generative_ai	9	361	May 28, 2024
Can't start the live llava on jetson orin nano developer kit Jetson Orin Nano generative_ai	9	815	June 4, 2024
Preempt-RT patch on Jetson kernel Jetson AGX Xavier preempt_rt	35	8372	August 23, 2023

LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui

Related topics