LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui

llava-llama-2-13b-chat-lightning-gptq through oogabooga: RAM usage went from 14.17GB → 20.39GB (6.22GB), but that seems low so take it with a grain of salt. That was after querying it on an image a few times. Think it should run in 16GB though.

Hi, @dusty_nv, LlamaSpeak project is cool and fast. Am I correct in saying that it can also run on x86 standalone PCs with Nvidia graphics card? This project has a bright future, especially it can be useful for people with visual impairments.

Here are the two optimizations I performed on my NVIDIA Jetson AGX Orin 64GB Developer Kit:

  • I disabled the graphical interface to free up RAM memory.
  • I turned on the MAXN mode.

See the video below

The usage of RAM

Thank you!

1 Like

Hi @shahizat, glad you got it working! Yes, if you run riva container and text-generation-webui on x86, it will run on x86 also. Someone at NVIDIA did that too.

I have been circling back to do another round of LLM performance optimizations (including time-to-first-token latency during the prefill stage), but will then continue work on the project.


Wanted to post some recent updates to the thread!

  • AWQ/TinyChat added some optimizations, and is now ~40% faster than llama.cpp

  • MLC/TVM I got working, and is ~60% faster than llama.cpp. Low prefill latency.

  • We released interactive Llava-Llama-2-chat-GPTQ tutorial (uses oogabooga)

  • I have an even faster llava-llama-2-chat integration with MLC and realtime CLIP encoder that I’m packaging up

  • Realtime NanoSAM! https://github.com/NVIDIA-AI-IOT/nanosam

  • I’ve been working on local_llm mini-package which supports AWQ/MLC (not in oogabooga) and works at the embedding level with unstructured chat for flexible multimodality

  • I’ve started work on lightweight multimodal vectorDB that can handle high-dimensional embeddings like images/audio/ect in addition to chat, using FAISS/RAFT and CUDA underneath for similarity search. With efficient memory management and zero-copy.


Hello, @dusty_nv. Thank you for your updates. I have just tested your 13B llava-llama-2 model example, and it is working very well. The results are impressive and provide a comprehensive description of the image.

The RAM was almost occupied, while it was running 13B parameter llava-llama-2 model

Dustin, are you aware of the reason why NVMe SSDs on Jetson Orin devices cannot achieve their full speed? Declared speed is up to 7GB/s, and I have verified that all firmware is up to date, the latest and greatest. I know that I am attempting to offload RAM at a rate of 204.8GB/s to an NVMe SSD with a speed of 3-7GB/s may not be reasonable at all. I observed that Jim from Jetson Hacks used a different NVMe SSD and achieved the same result as I did. Is it somehow related to Jetson architecture limitation?

Here is the benchmarks results of my NVMe SSD.

@shahizat glad you got the quantized llava-2 example working! I would post a new topic about your NVME performance so our hardware guys can take a look - I just use Samsung 970’s which I don’t believe have the same level of performance. You might want to check lspci to see what PCIe gen it’s linked at.

Hi @dusty_nv ,I am using your docker image on a AGX Xavier 16GB. I get 3t/s max with a13B Model with webui. My goal would be to reach 7t/s. Is that possible? thx

Hi @dusty_nv, I’ve been experimenting with the current python interface to llama.cpp on a 16GB Xavier AGX and I’m impressed with the results.

I hit one or two minor problems getting it going; I couldn’t build the code using make, and default cmake was back-levelled, but installing the latest cmake from source fixed the build.

It runs llama-2-13b-chat.ggmlv3.q6_K.ggufwell, giving me about 5 tokens/sec. If only I had an Orin :)

I’ve also run codellama-7b-instruct.Q6_K.ggufwith a 7k context length: that’s enough to tackle some serious coding.

1 Like

Hi @robert.semmler1000, I’m not sure, but which loader and 13B model (quantized/ect) are you using?

Hi @romilly.cocking, here are the steps I use to build llama_cpp_python (from Dockerfile): https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/llama_cpp/Dockerfile

That’s good to hear!, I’ve not yet tried codellama, although I did get llama-2 to write code for calling plugins that I defined for it in the system prompt (interspersed with chat). If you find a good way of using it let us know!

TheBloke/Llama-2-13B-chat-GPTQ · Hugging Face (just put it into the download text field) with ExllamaHF. It tells me an urllib and python version problem for exllamahf but it works.

Hi @shahizat and @dusty_nv , can you please share a link to the llamaspeak project which is shown in the video?
Can’t find a repository or something and it looks really awesome!

Hi @jschw, here is the link: https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/llamaspeak

1 Like

@robert.semmler1000 just FYI, I get ~40% better performance from llama.cpp and GGML/GGUF models than exllama on GPTQ models

Thank You! I use here Docker this image then dustynv/llama_cpp:ggml-r35.4.1. Do you use it in terminal only or with an UI? NVM i found it https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/llama_cpp

@robert.semmler1000 yes that one is just the core llama.cpp / llama_cpp_python packages, typically used over CLI (or with another app you write using those libraries). Or you can use text-generation-webui container which includes llama_cpp container and you can use the llama_cpp loader in oogabooga.

I keep llama_cpp:ggml for people still using GGML models, but they are easy to convert to GGUF (and TheBloke has lots of GGUF on Huggingface Hub already). llama_cpp:gguf tracks the upstream repos and is what the text-generation-webui container uses to build.

Hi @dusty_nv
I would like to thank you for the cool updates. I’m particularly interested in the deployment of Llava on Jetson board. I’m planning to order a Jetson Orin NX 16 gb to carry out the projects but I’ve seen that you recommend an AGX 32 gb as minimum requirements in your tutorial. Is there any chance for a quantized Llava to run on the Orin NX 16 module, or should I upgrade to the AGX 32?

@tarek.gas assuming you mean llava-13B, it’s difficult to pinpoint the precise memory usage including the CLIPEncoder and webui, but yea I think it would run in 16GB:

Certainly I’d think llava-7b wouldn’t be an issue, and I think Orin NX 16GB is a good option for deploying LLMs/ect in the Nano form-factor. I don’t think the AGX Orin 32GB devkits are sold anymore, they are all the 64GB devkits now (which admittedly is nice to have for experimenting and not worrying about memory usage, and it builds code fast)

1 Like

The models from DECI do run on my orin NX nicely out of the box Deci (Deci AI)

Any model that can work on my jetson nano 4gb ?

@thebigboss84 you might want to try the Microsoft Phi models:

They are 1.3B parameters, but are said to be better than normal for models of that size due to the quality of the training dataset.

Also there is the TinyLlama-1.1B project that is ongoing, last I checked it was not yet producing coherent output as it’s still being trained:

1 Like