LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui

llava-llama-2-13b-chat-lightning-gptq through oogabooga: RAM usage went from 14.17GB → 20.39GB (6.22GB), but that seems low so take it with a grain of salt. That was after querying it on an image a few times. Think it should run in 16GB though.

Hi, @dusty_nv, LlamaSpeak project is cool and fast. Am I correct in saying that it can also run on x86 standalone PCs with Nvidia graphics card? This project has a bright future, especially it can be useful for people with visual impairments.

Here are the two optimizations I performed on my NVIDIA Jetson AGX Orin 64GB Developer Kit:

  • I disabled the graphical interface to free up RAM memory.
  • I turned on the MAXN mode.

See the video below

The usage of RAM

Thank you!

Hi @shahizat, glad you got it working! Yes, if you run riva container and text-generation-webui on x86, it will run on x86 also. Someone at NVIDIA did that too.

I have been circling back to do another round of LLM performance optimizations (including time-to-first-token latency during the prefill stage), but will then continue work on the project.

1 Like

Wanted to post some recent updates to the thread!

  • AWQ/TinyChat added some optimizations, and is now ~40% faster than llama.cpp

  • MLC/TVM I got working, and is ~60% faster than llama.cpp. Low prefill latency.

  • We released interactive Llava-Llama-2-chat-GPTQ tutorial (uses oogabooga)

  • I have an even faster llava-llama-2-chat integration with MLC and realtime CLIP encoder that I’m packaging up

  • Realtime NanoSAM! https://github.com/NVIDIA-AI-IOT/nanosam

  • I’ve been working on local_llm mini-package which supports AWQ/MLC (not in oogabooga) and works at the embedding level with unstructured chat for flexible multimodality

  • I’ve started work on lightweight multimodal vectorDB that can handle high-dimensional embeddings like images/audio/ect in addition to chat, using FAISS/RAFT and CUDA underneath for similarity search. With efficient memory management and zero-copy.

1 Like

Hello, @dusty_nv. Thank you for your updates. I have just tested your 13B llava-llama-2 model example, and it is working very well. The results are impressive and provide a comprehensive description of the image.

The RAM was almost occupied, while it was running 13B parameter llava-llama-2 model

Dustin, are you aware of the reason why NVMe SSDs on Jetson Orin devices cannot achieve their full speed? Declared speed is up to 7GB/s, and I have verified that all firmware is up to date, the latest and greatest. I know that I am attempting to offload RAM at a rate of 204.8GB/s to an NVMe SSD with a speed of 3-7GB/s may not be reasonable at all. I observed that Jim from Jetson Hacks used a different NVMe SSD and achieved the same result as I did. Is it somehow related to Jetson architecture limitation?

Here is the benchmarks results of my NVMe SSD.

@shahizat glad you got the quantized llava-2 example working! I would post a new topic about your NVME performance so our hardware guys can take a look - I just use Samsung 970’s which I don’t believe have the same level of performance. You might want to check lspci to see what PCIe gen it’s linked at.