NVIDIA Jetson Orin hardware enables local LLM execution in a small form factor to suitably run 13B and 70B parameter LLama 2 models.
In this article we will demonstrate how to run variants of the recently released Llama 2 LLM from Meta AI on NVIDIA Jetson Hardware. What is amazing is how simple it is to get up and running. Everything needed to reproduce this content is more or less as easy as spinning up a Docker container that has provides all of the required software prerequisites for you. You will leave empowered with knowledge to explore running additional models on your own to power custom LLM based applications and services. Full instructions to reproduce this project can be found in this article at Hackster.io.
Thanks @pdecarlo for the awesome article! Itās great to see these LLMs running locally on Jetson. These Llama 2 models are a lot of fun to chat with š
If anyone needs a pre-built container image for the text-generation-webui (or other LLM packages), you can find them here:
Hi @dusty_nv, I was wondering, if it is possible to run just released by Nvidia āNeVA: NeMo Vision and Language Assistantā on the Jetson Orin devices?
Hi @shahizat, from looking at itās model card on NGC, Iām not sure the actual model has been released yet outside of the playground. I did find this though, which judging by the dependencies appears that it would run (although probably not fast, since itās just using HF transformers underneath). Itās on my todo-list to dig more into the Llava models and their derivatives and find the optimal inferencing path for those on Jetson.
On the other hand, I did release an early version of this interactive voice chat this week:
Wow @dusty_nv , Iām excited to test llamaspeak on the AGX Orin kit. I think it has the potential to be a very useful tool for visually impaired and blind people. The possibility to upload an image using Llava models and have it described would be a great improvement to the project. Iāve just recently tested minigpt.cpp(GitHub - Maknee/minigpt4.cpp: Port of MiniGPT4 in C++ (4bit, 5bit, 6bit, 8bit, 16bit CPU inference with GGML)) on the Xavier NX 16GB, but it was super slow with high level of āhallucinationā. I guess support of GPU offloading for GGML and GPTQ has not been added yet there.
Ah that looks cool! I was able to get it running with GPU enabled after applying some patches to it:
Itās already interactive using AGX Orin and the 13B models, but Iām in the process of updating the version of llama.cpp it uses to enable LLAMA_CUDA_FP16 (updating it to a version before GGUF was introduced and made GGML models incompatible)
Really cool that it can read text in the images! š
I like what we have going on here and this kind of collaboration! Together we can share local LLMs/ect with the community š
Glad that you were able to get it running also with CUDA acceleration, @shahizat š
Iāve noticed this vision LLM is more sensitive to the choice of quantization and parameter size than normal LLM, Iāve had to stick with the f16/q5_k combination to get sensible results. Let me know if you find other combinations that work for you.
Also, the upstream minigpt4 says it supports Llama2, but minigpt4 doesnāt appear to have explicitly been updated for that - if you try it, let me know.
USER: what does the road sign say?
ASSISTANT: The road sign says "Hoover Dam."
USER: how far away is the exit?
ASSISTANT: The exit is 1 mile away.
USER: what is the environment like?
ASSISTANT: The environment is desert-like, with a rocky landscape and a dirt road leading to the exit.
llava-llama-2-13b-chat
USER: what does the text in the road sign say?
ASSISTANT: The text in the road sign says "Hoover Dam Exit 2 Mile."
USER: How far away is the exit?
ASSISTANT: The exit is two miles away from the current location.
USER: What kind of environment is it?
ASSISTANT: The environment is a desert setting, with a mountain in the background.
This is unquantized, but still runs at interactive rates on AGX Orin, and I feel produces better results than MiniGPT-4 (although perhaps that is due to it using FP16). Llama-2 is supported in the upstream MiniGPT-4, but it doesnāt appear that minigpt4.cpp has been updated for it yet.
Wow again, @dusty_nv, This basically is what I need. I immediately wanted to test it on the NVIDIA Jetson Xavier NX using 7B parameres model. Unfortunately, the below issue appeared. It seems that bitsandbytes package needs to be built from source rather than installed directly via pip. Can you suggest any workarounds without rewriting dockerfile? Thank you!
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so
False
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.2
CUDA SETUP: Detected CUDA version 114
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Required library version not found: libbitsandbytes_cuda114_nocublaslt.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:
1. CUDA driver not installed
2. CUDA not installed
3. You have multiple conflicting CUDA libraries
4. Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via `conda list | grep cuda`.
@shahizat if you are using jetson-containers, it will use this dockerfile to build bitsandbytes from source:
The llava container is built on top of transformers container, and transformers container is built on top of bitsandbytes container. The reason I have all those dockerfiles is due to all the patches and complex dependencies to get it to build/run on ARM with CUDA acceleration (and to keep it building with all the upstream updates being made)
Iām now trying the Llava AutoGPTQ quantization that theyāve been working on. I think I may also be able to accelerate the CLIP feature extractor with TensorRT to reduce the image pre-processing time.
Very nice. By the way GitHub - ggerganov/llama.cpp: Port of Facebook's LLaMA model in C/C++ compiles nicely outside of docker with: ccmake ā¦ -GCodeBlocks -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPYTHON_EXECUTABLE=/usr/bin/python3.8 -DPYTHON_INCLUDE_DIR=$(python3 -c āfrom distutils.sysconfig import get_python_inc; print(get_python_inc())ā) -DPYTHON_LIBRARY=$(python3 -c āimport distutils.sysconfig as sysconfig; print(sysconfig.get_config_var(āLIBDIRā))ā) -DCUDA_ARCH_BIN=8.7 -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.4 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.4/bin/nvcc -DCMAKE_CUDA_ARCHITECTURES=ā87ā -DCMAKE_CUDA_COMPILER_ID=āNVIDIAā -DCMAKE_CUDA_STANDARD_COMPUTED_DEFAULT=14 -DCMAKE_CUDA_EXTENSIONS_COMPUTED_DEFAULT=āONā
and than cubllas and fp16 enabled. and -march=armv8.2-a in the cxx and c options available after pressing t
what i noticed is that one can not push all layers to the GPU without risking strange behaviour.
main -m ./models/13B/ggml-model-q4_0.gguf -n 512 --n-gpu-layers 30 --repeat_penalty 1.0 --color -i -r āUser:ā -f prompts/chat-with-bob.txt
its 30 layes from 43 layers to the GPU and takes around 4GB
works nicely.
@herr_dieter_graef which Jetson are you on? If it has <= 8GB of ram, does it happen with 7B? I run the 7B models on Orin Nano.
Or perhaps that prompt expects a chat model and you are using the base model?
llama.cpp does build out-of-the-box, and technically you donāt need containers for any of this and could just install everything by hand, but for me personally it quickly becomes untenable with all the packages and constant updates. Not all of them install so cleanly. I do have a couple patches for llama.cpp that I apply, they are to do proper tokenization of BOS/EOS tokens in chat models when using the API. I use llama_cpp_python (which also installs cleanly and includes llama.cpp as a subrepo, so you only need the one)
If you suspect something with your build may be off, that is making the strange behavior, try seeing if it also happens in llama_cpp container.
Thank you for the link. Unfortunatelly your llama.cpp is completely different to my llama.cpp. But it seems that your patches found their way into the sources .
Thatās really cool, @dusty_nv; it provided a decent description of the image. How much RAM was utilized? Let me try running it on the Nvidia Jetson Xavier 16GB. Presumably, it is going to fail, if it is a 13B parameter model)))
@shahizat device is busy for awhile, but I recall it being similar to llama2-13B usage with 4-bit quantization. Getting the actual memory number is kind of tricky. liuhaotian doesnāt have a similar GPTQ quant for llava-llama-2-7b (presumably because itās a LoRA), but thereās a merged version here that you could try to quantize with AutoGPTQ:
Haha wowā¦I still think that would take 90+ GB of memory right? There have been some discussions about running it on CPU systems with lots of RAM over at r/LocalLlama:
Personally, Iām happy with the llama-2-13B level of intelligence and interactivity, and trying to get back to the same place performance-wise with the Vision LLM variants of it. Then add webcam feed to llamaspeak so it can āsee youā :)