LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui

NVIDIA Jetson Orin hardware enables local LLM execution in a small form factor to suitably run 13B and 70B parameter LLama 2 models.

In this article we will demonstrate how to run variants of the recently released Llama 2 LLM from Meta AI on NVIDIA Jetson Hardware. What is amazing is how simple it is to get up and running. Everything needed to reproduce this content is more or less as easy as spinning up a Docker container that has provides all of the required software prerequisites for you. You will leave empowered with knowledge to explore running additional models on your own to power custom LLM based applications and services. Full instructions to reproduce this project can be found in this article at Hackster.io.

3 Likes

Thanks @pdecarlo for the awesome article! It’s great to see these LLMs running locally on Jetson. These Llama 2 models are a lot of fun to chat with 😊

If anyone needs a pre-built container image for the text-generation-webui (or other LLM packages), you can find them here:

3 Likes

Hi @dusty_nv, I was wondering, if it is possible to run just released by Nvidia “NeVA: NeMo Vision and Language Assistant” on the Jetson Orin devices?

1 Like

Hi @shahizat, from looking at it’s model card on NGC, I’m not sure the actual model has been released yet outside of the playground. I did find this though, which judging by the dependencies appears that it would run (although probably not fast, since it’s just using HF transformers underneath). It’s on my todo-list to dig more into the Llava models and their derivatives and find the optimal inferencing path for those on Jetson.

On the other hand, I did release an early version of this interactive voice chat this week:

2 Likes

Wow @dusty_nv , I’m excited to test llamaspeak on the AGX Orin kit. I think it has the potential to be a very useful tool for visually impaired and blind people. The possibility to upload an image using Llava models and have it described would be a great improvement to the project. I’ve just recently tested minigpt.cpp(GitHub - Maknee/minigpt4.cpp: Port of MiniGPT4 in C++ (4bit, 5bit, 6bit, 8bit, 16bit CPU inference with GGML)) on the Xavier NX 16GB, but it was super slow with high level of ‘hallucination’. I guess support of GPU offloading for GGML and GPTQ has not been added yet there.

1 Like

Ah that looks cool! I was able to get it running with GPU enabled after applying some patches to it:

It’s already interactive using AGX Orin and the 13B models, but I’m in the process of updating the version of llama.cpp it uses to enable LLAMA_CUDA_FP16 (updating it to a version before GGUF was introduced and made GGML models incompatible)

Really cool that it can read text in the images! 👍
I like what we have going on here and this kind of collaboration! Together we can share local LLMs/ect with the community 😊

2 Likes

Impressive results @dusty_nv, i’ve already tested it. I used 7B parameters model. Indeed, it uses GPU offloading.

See below results:

./run.sh $(./autotag minigpt4) /bin/bash -c 'cd /opt/minigpt4.cpp/minigpt4 && python3 webui.py \
  $(huggingface-downloader --type=dataset maknee/minigpt4-7b-ggml/minigpt4-7B-f16.bin) \
  $(huggingface-downloader --type=dataset maknee/ggml-vicuna-v0-quantized/ggml-vicuna-7B-v0-q5_k.bin)'

Utilization of RAM

1 Like

Glad that you were able to get it running also with CUDA acceleration, @shahizat 👍

I’ve noticed this vision LLM is more sensitive to the choice of quantization and parameter size than normal LLM, I’ve had to stick with the f16/q5_k combination to get sensible results. Let me know if you find other combinations that work for you.

Also, the upstream minigpt4 says it supports Llama2, but minigpt4 doesn’t appear to have explicitly been updated for that - if you try it, let me know.

2 Likes

@shahizat I added support for llava at https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/llava and tested it on the llama-2 variants:

llava-llama-2-7b-chat

USER: what does the road sign say?
ASSISTANT: The road sign says "Hoover Dam."

USER: how far away is the exit?
ASSISTANT: The exit is 1 mile away.

USER: what is the environment like?
ASSISTANT: The environment is desert-like, with a rocky landscape and a dirt road leading to the exit.

llava-llama-2-13b-chat

USER: what does the text in the road sign say?
ASSISTANT: The text in the road sign says "Hoover Dam Exit 2 Mile."

USER: How far away is the exit?
ASSISTANT: The exit is two miles away from the current location.

USER: What kind of environment is it?
ASSISTANT: The environment is a desert setting, with a mountain in the background.

This is unquantized, but still runs at interactive rates on AGX Orin, and I feel produces better results than MiniGPT-4 (although perhaps that is due to it using FP16). Llama-2 is supported in the upstream MiniGPT-4, but it doesn’t appear that minigpt4.cpp has been updated for it yet.

It does appear however that they have Llava AutoGPTQ quantization working at liuhaotian/llava-llama-2-13b-chat-lightning-gptq · Hugging Face

1 Like

Wow again, @dusty_nv, This basically is what I need. I immediately wanted to test it on the NVIDIA Jetson Xavier NX using 7B parameres model. Unfortunately, the below issue appeared. It seems that bitsandbytes package needs to be built from source rather than installed directly via pip. Can you suggest any workarounds without rewriting dockerfile? Thank you!

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so
False
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.2
CUDA SETUP: Detected CUDA version 114
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Required library version not found: libbitsandbytes_cuda114_nocublaslt.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:
1. CUDA driver not installed
2. CUDA not installed
3. You have multiple conflicting CUDA libraries
4. Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via `conda list | grep cuda`.
1 Like

@shahizat if you are using jetson-containers, it will use this dockerfile to build bitsandbytes from source:

The llava container is built on top of transformers container, and transformers container is built on top of bitsandbytes container. The reason I have all those dockerfiles is due to all the patches and complex dependencies to get it to build/run on ARM with CUDA acceleration (and to keep it building with all the upstream updates being made)

I’m now trying the Llava AutoGPTQ quantization that they’ve been working on. I think I may also be able to accelerate the CLIP feature extractor with TensorRT to reduce the image pre-processing time.

1 Like

Very nice. By the way GitHub - ggerganov/llama.cpp: Port of Facebook's LLaMA model in C/C++ compiles nicely outside of docker with: ccmake … -GCodeBlocks -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPYTHON_EXECUTABLE=/usr/bin/python3.8 -DPYTHON_INCLUDE_DIR=$(python3 -c “from distutils.sysconfig import get_python_inc; print(get_python_inc())”) -DPYTHON_LIBRARY=$(python3 -c “import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var(‘LIBDIR’))”) -DCUDA_ARCH_BIN=8.7 -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.4 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.4/bin/nvcc -DCMAKE_CUDA_ARCHITECTURES=“87” -DCMAKE_CUDA_COMPILER_ID=“NVIDIA” -DCMAKE_CUDA_STANDARD_COMPUTED_DEFAULT=14 -DCMAKE_CUDA_EXTENSIONS_COMPUTED_DEFAULT=“ON”

and than cubllas and fp16 enabled. and -march=armv8.2-a in the cxx and c options available after pressing t
what i noticed is that one can not push all layers to the GPU without risking strange behaviour.
main -m ./models/13B/ggml-model-q4_0.gguf -n 512 --n-gpu-layers 30 --repeat_penalty 1.0 --color -i -r “User:” -f prompts/chat-with-bob.txt

its 30 layes from 43 layers to the GPU and takes around 4GB
works nicely.

@herr_dieter_graef which Jetson are you on? If it has <= 8GB of ram, does it happen with 7B? I run the 7B models on Orin Nano.

Or perhaps that prompt expects a chat model and you are using the base model?

llama.cpp does build out-of-the-box, and technically you don’t need containers for any of this and could just install everything by hand, but for me personally it quickly becomes untenable with all the packages and constant updates. Not all of them install so cleanly. I do have a couple patches for llama.cpp that I apply, they are to do proper tokenization of BOS/EOS tokens in chat models when using the API. I use llama_cpp_python (which also installs cleanly and includes llama.cpp as a subrepo, so you only need the one)

If you suspect something with your build may be off, that is making the strange behavior, try seeing if it also happens in llama_cpp container.

Thank you for the link. Unfortunatelly your llama.cpp is completely different to my llama.cpp. But it seems that your patches found their way into the sources .

Update: I was able to get quantized liuhaotian/llava-llama-2-13b-chat-lightning-gptq model running in text-generation-webui with multimodal support and AutoGPTQ:

It needed a couple patches, which have been added in commit 58b11b1

Then this ran it:

./run.sh $(./autotag text-generation-webui) /bin/bash -c \
  "cd /opt/text-generation-webui && python3 server.py --listen \
	--model-dir=/data/models/text-generation-webui \
	--model=llava-llama-2-13b-chat-lightning-gptq \
	--multimodal-pipeline=llava-llama-2-13b \
	--extensions=multimodal \
	--chat \
	--verbose"
1 Like

That’s really cool, @dusty_nv; it provided a decent description of the image. How much RAM was utilized? Let me try running it on the Nvidia Jetson Xavier 16GB. Presumably, it is going to fail, if it is a 13B parameter model)))

@shahizat device is busy for awhile, but I recall it being similar to llama2-13B usage with 4-bit quantization. Getting the actual memory number is kind of tricky. liuhaotian doesn’t have a similar GPTQ quant for llava-llama-2-7b (presumably because it’s a LoRA), but there’s a merged version here that you could try to quantize with AutoGPTQ:

1 Like

Hi @dusty_nv, just fyi, 4bit GPTQ format quantised model of Falcon 180B was released, TheBloke/Falcon-180B-Chat-GPTQ · Hugging Face

Haha wow…I still think that would take 90+ GB of memory right? There have been some discussions about running it on CPU systems with lots of RAM over at r/LocalLlama:

https://www.reddit.com/r/LocalLLaMA/comments/16cm9d0/options_for_running_falcon_180b_on_kind_of_sane/

Personally, I’m happy with the llama-2-13B level of intelligence and interactivity, and trying to get back to the same place performance-wise with the Vision LLM variants of it. Then add webcam feed to llamaspeak so it can “see you” :)

how did that experiment turn out? I want to try it. 13b xavier 16gb