Today is my first day trying this Orin Nano Super.
I was succesful in setting up “olama-server”
And I can ssh into the device.
(The above are not related to the issue below:)
Then I wanted to try the “text-generation-webui”
The guide says: jetson-containers run $(autotag text-generation-webui)
but this will start build process that fails. (tried twice)
Then I pulled a container with: $ docker pull dustynv/text-generation-webui:r35.4.1
and start the server with: jetson-containers$ ./run.sh dustynv/text-generation-webui:r35.4.1
So far so good. I can access the website!
I downloaded few models, but none of them work.
Even the one from the tutorial video fails:
Model: TheBloke_Llama-2-7B-GPTQ
Model_Loader: ExLlamav2_HF
Traceback (most recent call last):
File “/opt/text-generation-webui/modules/ui_model_menu.py”, line 213, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
File “/opt/text-generation-webui/modules/models.py”, line 87, in load_model
output = load_func_maploader
File “/opt/text-generation-webui/modules/models.py”, line 389, in ExLlamav2_HF_loader
return Exllamav2HF.from_pretrained(model_name)
File “/opt/text-generation-webui/modules/exllamav2_hf.py”, line 170, in from_pretrained
return Exllamav2HF(config)
File “/opt/text-generation-webui/modules/exllamav2_hf.py”, line 44, in init
self.ex_model.load(split)
File “/usr/local/lib/python3.8/dist-packages/exllamav2/model.py”, line 248, in load
for item in f: return item
File “/usr/local/lib/python3.8/dist-packages/exllamav2/model.py”, line 266, in load_gen
module.load()
File “/usr/local/lib/python3.8/dist-packages/exllamav2/attn.py”, line 188, in load
self.input_layernorm.load()
File “/usr/local/lib/python3.8/dist-packages/exllamav2/rmsnorm.py”, line 24, in load
w = self.load_weight()
File “/usr/local/lib/python3.8/dist-packages/exllamav2/module.py”, line 116, in load_weight
tensors = self.load_multi([“weight”], override_key = override_key)
File “/usr/local/lib/python3.8/dist-packages/exllamav2/module.py”, line 77, in load_multi
tensors[k] = stfile.get_tensor(key + “.” + k, device = self.device())
File “/usr/local/lib/python3.8/dist-packages/exllamav2/fasttensors.py”, line 118, in get_tensor
return f.get_tensor(key)
File “/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py”, line 255, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from Download The Latest Official NVIDIA Drivers
I have some Linux experience, but no knowledge on containers or python.
Hopefully you can help, thanks.
Hi,
As mentioned in the below link:
Please run the memory optimization first and test the 7B models with 4-bit quantization.
Thanks.
I did the optimizations:
total used free shared buff/cache available
Mem: 7.4Gi 503Mi 6.4Gi 26Mi 581Mi 6.7Gi
Swap: 8.0Gi 0B 8.0Gi
Then I tried with these:
TheBloke/Llama-2-7b-Chat-GGUF
model: llama-2-7b-chat.Q4_K_M.gguf
model-loader: llama.cpp
n-gpu-layers = 128
And got error:
20:31:52-888982 INFO Loading llama-2-7b-chat.Q4_K_M.gguf
20:31:52-983571 INFO llama.cpp weights detected: /data/models/text-generation-webui/llama-2-7b-chat.Q4_K_M.gguf
20:31:52-986285 ERROR Failed to load the model.
Traceback (most recent call last):
File "/opt/text-generation-webui/modules/ui_model_menu.py", line 213, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
File "/opt/text-generation-webui/modules/models.py", line 87, in load_model
output = load_func_map[loader](model_name)
File "/opt/text-generation-webui/modules/models.py", line 250, in llamacpp_loader
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File "/opt/text-generation-webui/modules/llamacpp_model.py", line 63, in from_pretrained
Llama = llama_cpp_lib().Llama
AttributeError: 'NoneType' object has no attribute 'Llama'
Hi,
Sorry for the missing.
It looks like you are using super mode so the environment should be r36.4.2 or r36.4.3.
As there are some dependencies between the GPU driver and CUDA-related libraries, please use the container built for r36.4.x instead.
dustynv/text-generation-webui:r35.4.1
might have some unexpected issues when running on the r36 environment.
You can build one with jetson-container directly:
$ jetson-containers run $(autotag text-generation-webui)
Namespace(packages=['text-generation-webui'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False)
-- L4T_VERSION=36.4.3 JETPACK_VERSION=6.2 CUDA_VERSION=12.6
-- Finding compatible container image for ['text-generation-webui']
Couldn't find a compatible container for text-generation-webui, would you like to build it? [y/N] y
-- Building containers ['build-essential', 'pip_cache:cu126', 'cuda:12.6', 'cudnn', 'python', 'numpy', 'cmake', 'onnx', 'pytorch:2.5', 'torchvision', 'huggingface_hub', 'rust', 'transformers', 'auto_gptq', 'flash-attention', 'exllama', 'llama_cpp', 'triton', 'auto_awq', 'text-generation-webui']
-- Building container text-generation-webui:r36.4.3-build-essential
...
Thanks.
That fixed it! Thanks!
There are a lot of new abbreviations and concepts. What is the best way to go forward from here?
I tried loading some other models from huggingface, but they all failed (even the smallest 1&2b models).
Hi,
It’s recommended to follow our tutorial first:
The Llama 7B model should work on Orin Nano 8GB.
But please remember to run the memory optimization first.
Thanks.