Flash_Attention_2 && BitsandBytes && more

Ideally I’d like to get these working but even after I docker exec -it /bin/bash;
pip install bitsandbytes; pip install flash_attn;

I setup the LD_LIBRARY_PATH so it knows where to look for the python modules both in the container as well as on my Jetson Orin. I even went as far and installed the flash_attn and bitsandbytes on my local Orin 64GB.
It will still claim it can’t find it it.
Also, does the model loader have an impact on how fast the model performs and how do I match the loader with the model? I know there’s gguf and gptq. Which one is the fastest and which loader should be used? I’ve been experimenting but haven’t found the happy medium I’m looking for.
As mentioned, I’ve been looking for a happy medium between 13B and 70B. 70B is too slow but when I try other smaller parameter count models e.g. 34B it’s either on par with the 70B or slower.
Another thing that interests me is mixture of experts. I found one thats 4x7B and it runs about as quickly as the llama2 70 billion. Which leads me down the road of why is it so slow when it’s actually only inferencing two layers at a time AND the parameter count is less? I’d really like to better understand some of the options and how I can match them with a given model, I understand I can’t lower the bit depth and quantize but I’d like to understanding how inference consumes resources and learn more about them language in the model cards and cross reference that to the options in text-generation-webui. I know it’s (performance) quadradic and that’s why I’m interesting in Flash_Attention_2 but maybe I’m going down the wrong path.
With respect to RAG using text-generation-webui. I enable superboogav2 and apply and it loads and then reloads the page and doesn’t crash. I validate that it’s still selected and it is. But what does this actually do and where do I access it’s additional RAG features? It’s be cool if it could learn from PDF files.
Lastly, I understand this isn’t RAG but if you save chat sessions and then import them, the model will reference the prior session. Any suggestions will be greatly appreciated and sorry for the long message I just have many ideas and questions but need help establishing a systematic way to experiment and understanding the tools better will enable me to more effectively make progress on this generative AI journey.

Jason T.

Any suggestions will be greatly appreciated and sorry for the long message I just have many ideas and questions but need help establishing a systematic way to experiment and understanding the tools better will enable me to more effectively make progress on this generative AI journey.

@jtmuzix I encourage you to look at my builds from GitHub - dusty-nv/jetson-containers: Machine Learning Containers for NVIDIA Jetson and JetPack-L4T and the guides from Jetson AI Lab. Much like you alluded to with regards to maintaining a systematic approach, I use these containers to keep my development environment sane and reproducible when working with these complex AI packages and their dependencies.

I would not recommend using bitsandbytes, it’s slow for inferencing and has been superceeded by other quantization techniques like GGUF and GPTQ. bitsandbytes also required patching for Jetson/ARM64 and for the previous reasons I haven’t taken the time to update those patches for JetPack 6. However I have CUDA-accelerated containers already built for llama.cpp and AutoGPTQ on Jetson/ARM64 and JetPack 6. These packages are then in turn included in my text-generation-webui container (tutorial here).

And I have built flash_attn inside of other containers that use it, but don’t have a standalone container for that library. For example, it is included in MLC/TVM, which is the fastest LLM inferencing library currently available on Jetson (and is what’s used on the benchmarks on jetson-ai-lab.com and my demos)

That is great information and thank you very much. I have one more question and that is how do I enable --trust-remote-code when using the Nvidia container system with run.sh and autotag. text-generation-webui is running in a docker container and I’m not sure where to put that flag to allow remote code execution. Enabling this feature wouuld allow me to use many more language models. I’m thinking it may have to be add into a docker compose file but am unsure? Also, I’ve tried building the program using start_linux.sh but it errors out with respect to a version for tourch. Any advice will be greatly appreciated.


Hi @jtmuzix, try starting the text-generation-webui container like this instead:

./run.sh $(./autotag text-generation-webui) /bin/bash -c '
   cd /opt/text-generation-webui && python3 server.py \
  --model-dir=/data/models/text-generation-webui \
  --listen --verbose \

Specifying the full command-line when you start the container allows you to run your own server.py options instead of the default ones.

Why am I not able to get 3rd party extensions loading. I’ve put --extensions // as an argument for server.py. It loads any / all of the already included extensions but I can’t get any 3rd party extension to load. I’ve given the 3rd party extensions permission of 777. Any ideas would be greatly appreciated because there are a few extensions I’m interested in trying (obviously lol). Either I have dep problems but I’ve run the requirements,txt via pip and had no issues. I also updated all of my pip modules. The one thing that I still have an issue with is chromaDB (Vector Database) so I’m trying to work through thtat. Please let me know any logs/info you might need? Thank you in advance for any assistance!


@jtmuzix admittedly I’m not very familiar with 3rd-party oogabooga extensions (I have used the built-in ones like multimodal) - does it automatically download them for you? Or do you need to install them first? If that’s the case, I would just start the text-generation-webui container with a bash terminal, and then you can navigate around and clone/pull what code you need and where:

./run.sh $(./autotag text-generation-webui) /bin/bash

That will give you a command prompt into the container from which you can run normal linux commands. Then when you’re ready, you can invoke server.py from the same shell with your desired arguments.

You may find this interesting: Adding extensions to text-generation-webui (as packaged by @dusty_nv in jetson-containers)

1 Like