MiniGPT-4 on Jetson Orin Nano 8Gb Dev kit not working

kalustian · May 22, 2024, 4:09pm

Hello, Not sure is this is the right group. I am trying to run the MiniGPT-4 from tutorial https://www.jetson-ai-lab.com/tutorial_minigpt4.html.

I have installed the docker and run the command listed in the tutorial …and it looks like it went and download all then files and them is just nothing., it just return the the command line as seen below :

ggml_init_cublas: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7
llama.cpp: loading model from /data/models/huggingface/datasets–maknee–ggml-vicuna-v0-quantized/snapshots/1d8789f34eb803bf52daf895c7ecfd2559cf5ccc/ggml-vicuna-13B-v0-q5_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 17 (mostly Q5_K - Medium)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 582.00 MB (+ 1600.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 10959 MB
kalustian@ubuntu:~$

NOTA: I could be able to run successfully the following tutorial " Tutorial - text-generation-webui", but not the above one.

Any idea on what the issue might be?

Thanks

dusty_nv · May 22, 2024, 4:22pm

Hi @kalustian, it would have run out of memory on Orin Nano 8GB - apparently there were updates to the model or llama.cpp causing it, I will update the docs to reflect that, sorry about that. On Orin Nano, I would use the VILA-2.7b or VILA1.5-3b multimodal models:

kalustian · May 22, 2024, 5:04pm

I am trying to use and learn into the Jetson, sorry for the question…what would be the terminal commands I should use to download and run VILA-2.7b or VILA1.5-3b ?

dusty_nv · May 22, 2024, 6:02pm

@kalustian see the NanoVLM page I linked to above for the commands

kalustian · May 22, 2024, 6:32pm

under the " Multimodal Chat" option , I copy/pasted the commands:

jetson-containers run $(autotag nano_llm)
python3 -m nano_llm.chat --api=mlc
–model Efficient-Large-Model/VILA1.5-3b
–max-context-len 256
–max-new-tokens 32

After few minutes I got below result:

dustynv/nano_llm:24.5.1-r36.2.0

sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/kalustian/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock dustynv/nano_llm:24.5.1-r36.2.0 python3 -m nano_llm.chat --api=mlc --model Efficient-Large-Model/VILA1.5-3b --max-context-len 256 --max-new-tokens 32
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 48990.07it/s]
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 45473.96it/s]
18:21:39 | INFO | loading /data/models/huggingface/models–Efficient-Large-Model–VILA1.5-3b/snapshots/699b413ed13620957e955bd7fb938852afa258fc with MLC
18:21:40 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors

Using path “/data/models/mlc/dist/models/VILA1.5-3b” for model “VILA1.5-3b”
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights… This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 2%|█▏ | 3/197 [00:02<02:25, 1.33tensors/s]Traceback (most recent call last): | 1/327 [00:02<15:30, 2.85s/tensors]
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/NanoLLM/nano_llm/chat/main.py”, line 30, in
model = NanoLLM.from_pretrained(
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 73, in from_pretrained
model = MLCModel(model_path, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 277, in quantize
subprocess.run(cmd, executable=‘/bin/bash’, shell=True, check=True)
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors ’ died with <Signals.SIGKILL: 9>.
kalustian@ubuntu:~$

I was expecting the interactive console-based chat with Llava to start…well looks like no.

I also tried the " Automated Prompts" option…and here is what I see in the terminal:

Namespace(packages=[‘nano_llm’], prefer=[‘local’, ‘registry’, ‘build’], disable=[‘’], user=‘dustynv’, output=‘/tmp/autotag’, quiet=False, verbose=False)
– L4T_VERSION=36.3.0 JETPACK_VERSION=6.0 CUDA_VERSION=12.2
– Finding compatible container image for [‘nano_llm’]
[sudo] password for kalustian:
dustynv/nano_llm:24.5.1-r36.2.0

sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/kalustian/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock dustynv/nano_llm:24.5.1-r36.2.0 python3 -m nano_llm.chat --api=mlc --model Efficient-Large-Model/VILA1.5-3b --max-context-len 256 --max-new-tokens 32 --prompt /data/images/hoover.jpg --prompt ‘what does the road sign say?’ --prompt ‘what kind of environment is it?’ --prompt reset --prompt /data/images/lake.jpg --prompt ‘please describe the scene.’ --prompt ‘are there any hazards to be aware of?’
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
Fetching 13 files: 0%| | 0/13 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 48167.80it/s]
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 36944.65it/s]
18:30:01 | INFO | loading /data/models/huggingface/models–Efficient-Large-Model–VILA1.5-3b/snapshots/699b413ed13620957e955bd7fb938852afa258fc with MLC
18:30:03 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors

Using path “/data/models/mlc/dist/models/VILA1.5-3b” for model “VILA1.5-3b”
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights… This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 1%|█ | 2/197 [00:02<03:52, 1.19s/tensors]Traceback (most recent call last): | 1/327 [00:02<15:28, 2.85s/tensors]
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/NanoLLM/nano_llm/chat/main.py”, line 30, in
model = NanoLLM.from_pretrained(
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 73, in from_pretrained
model = MLCModel(model_path, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 277, in quantize
subprocess.run(cmd, executable=‘/bin/bash’, shell=True, check=True)
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors ’ died with <Signals.SIGKILL: 9>.
kalustian@ubuntu:~$

dusty_nv · May 22, 2024, 9:30pm

@kalustian your board keeps running out of memory, on processes that other Orin Nano users have been able to run (perhaps this related to why you were unable to run the MiniGPT4 example too). Can you try mounting SWAP, disabling ZRAM, and if necessary disabling the desktop UI like shown here:

github.com

dusty-nv/jetson-containers/blob/master/docs/setup.md#mounting-swap

# System Setup

Install the latest version of JetPack 4 on Nano/TX1/TX2, JetPack 5 on Xavier, or JetPack 6 on Orin.  The following versions are supported:

* JetPack 4.6.1+ (>= L4T R32.7.1)
* JetPack 5.1+  (>= L4T R35.2.1)
* JetPack 6.0 DP (L4T R36.2.0)
> [!NOTE]  
> <sup>- Building on/for x86 platforms isn't supported at this time (one can typically install/run packages the upstream way there)</sup><br>
> <sup>- The below steps are optional for [pulling/running](/docs/run.md) existing container images from registry, but recommended for building containers locally.</sup>

## Clone the Repo

This will download and install the jetson-containers utilities:

```bash
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh
```

This file has been truncated. show original

kalustian · May 23, 2024, 4:47am

Eureka moment ?

Well, I started again from ground zero , re-image the NVme card, and applied your suggested RAM optimizations settings. Now when I run the commands under the " Multimodal Chat" option , I copy/pasted the commands:

jetson-containers run $(autotag nano_llm)
python3 -m nano_llm.chat --api=mlc
–model Efficient-Large-Model/VILA1.5-3b
–max-context-len 256
–max-new-tokens 32

I got below results in the terminal:

PROMPT: Who was the First successful ascent to the summit of Mount Everest? At what time they reached the summit?

The first successful ascent to the summit of Mount Everest was made by a team of three climbers:

Sir Edmund Hill

┌───────────────┬─────────────┐
│ embed_time │ 0.000218919 │
├───────────────┼─────────────┤
│ input_tokens │ 29 │
├───────────────┼─────────────┤
│ output_tokens │ 32 │
├───────────────┼─────────────┤
│ prefill_time │ 0.0175087 │
├───────────────┼─────────────┤
│ prefill_rate │ 1656.32 │
├───────────────┼─────────────┤
│ decode_time │ 1.07084 │
├───────────────┼─────────────┤
│ decode_rate │ 29.883 │
└───────────────┴─────────────┘

QUESTION: is this the expected result ?

Thanks

dusty_nv · May 28, 2024, 2:18pm

Sure @kalustian, there was no image you put into that chat, and it cut off the reply early because the maximum generation length was set to 32 tokens.

kalustian · May 28, 2024, 2:50pm

No worries, I can be able to make it work - thanks a lot for your help & support

system · June 19, 2024, 6:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can't start the live llava on jetson orin nano developer kit Jetson Orin Nano generative_ai	9	698	June 4, 2024
NanoVLM Issue on Jetson Orin Nano Jetson Orin Nano generative_ai	9	573	June 6, 2024
Errors on tutorial NanoVLM Jetson Orin Nano generative_ai	4	410	May 28, 2024
TensorRT-LLM for jetson errors Jetson AGX Orin generative_ai , paligemma , kosmos-2 , llama	14	273	January 16, 2025
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	23326	May 10, 2024
Error on following "NanoVLM - Efficient Multimodal Pipeline" Jetson Orin Nano generative_ai	2	213	May 24, 2024
Introducing Ollama Support for Jetson Devices Jetson Projects cuda , natural-language-processing-nlp , artificialintelligence , interactive , docker-machine-learning , generative_ai	29	9670	August 28, 2024
Live Llava on Orin Jetson Projects generative_ai	18	1919	January 22, 2025
NVIDIA Jetson Nano 2GB Developer Kit available now Jetson Nano	79	6309	March 10, 2022
Jetson-containers does not build correctly Jetson Orin Nano containers	14	104	January 23, 2025

MiniGPT-4 on Jetson Orin Nano 8Gb Dev kit not working

Related topics