MiniGPT-4 on Jetson Orin Nano 8Gb Dev kit not working

Hello, Not sure is this is the right group. I am trying to run the MiniGPT-4 from tutorial https://www.jetson-ai-lab.com/tutorial_minigpt4.html.

I have installed the docker and run the command listed in the tutorial …and it looks like it went and download all then files and them is just nothing., it just return the the command line as seen below :

ggml_init_cublas: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7
llama.cpp: loading model from /data/models/huggingface/datasets–maknee–ggml-vicuna-v0-quantized/snapshots/1d8789f34eb803bf52daf895c7ecfd2559cf5ccc/ggml-vicuna-13B-v0-q5_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 17 (mostly Q5_K - Medium)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 582.00 MB (+ 1600.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 10959 MB
kalustian@ubuntu:~$

NOTA: I could be able to run successfully the following tutorial " Tutorial - text-generation-webui", but not the above one.

Any idea on what the issue might be?

Thanks

Hi @kalustian, it would have run out of memory on Orin Nano 8GB - apparently there were updates to the model or llama.cpp causing it, I will update the docs to reflect that, sorry about that. On Orin Nano, I would use the VILA-2.7b or VILA1.5-3b multimodal models:

I am trying to use and learn into the Jetson, sorry for the question…what would be the terminal commands I should use to download and run VILA-2.7b or VILA1.5-3b ?

@kalustian see the NanoVLM page I linked to above for the commands

under the " Multimodal Chat" option , I copy/pasted the commands:

jetson-containers run $(autotag nano_llm)
python3 -m nano_llm.chat --api=mlc
–model Efficient-Large-Model/VILA1.5-3b
–max-context-len 256
–max-new-tokens 32

After few minutes I got below result:

dustynv/nano_llm:24.5.1-r36.2.0

  • sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/kalustian/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock dustynv/nano_llm:24.5.1-r36.2.0 python3 -m nano_llm.chat --api=mlc --model Efficient-Large-Model/VILA1.5-3b --max-context-len 256 --max-new-tokens 32
    /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
    warnings.warn(
    /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
    warnings.warn(
    Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 48990.07it/s]
    Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 45473.96it/s]
    18:21:39 | INFO | loading /data/models/huggingface/models–Efficient-Large-Model–VILA1.5-3b/snapshots/699b413ed13620957e955bd7fb938852afa258fc with MLC
    18:21:40 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors

Using path “/data/models/mlc/dist/models/VILA1.5-3b” for model “VILA1.5-3b”
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights… This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 2%|█▏ | 3/197 [00:02<02:25, 1.33tensors/s]Traceback (most recent call last): | 1/327 [00:02<15:30, 2.85s/tensors]
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/NanoLLM/nano_llm/chat/main.py”, line 30, in
model = NanoLLM.from_pretrained(
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 73, in from_pretrained
model = MLCModel(model_path, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 277, in quantize
subprocess.run(cmd, executable=‘/bin/bash’, shell=True, check=True)
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors ’ died with <Signals.SIGKILL: 9>.
kalustian@ubuntu:~$

I was expecting the interactive console-based chat with Llava to start…well looks like no.

I also tried the " Automated Prompts" option…and here is what I see in the terminal:

Namespace(packages=[‘nano_llm’], prefer=[‘local’, ‘registry’, ‘build’], disable=[‘’], user=‘dustynv’, output=‘/tmp/autotag’, quiet=False, verbose=False)
– L4T_VERSION=36.3.0 JETPACK_VERSION=6.0 CUDA_VERSION=12.2
– Finding compatible container image for [‘nano_llm’]
[sudo] password for kalustian:
dustynv/nano_llm:24.5.1-r36.2.0

  • sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/kalustian/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock dustynv/nano_llm:24.5.1-r36.2.0 python3 -m nano_llm.chat --api=mlc --model Efficient-Large-Model/VILA1.5-3b --max-context-len 256 --max-new-tokens 32 --prompt /data/images/hoover.jpg --prompt ‘what does the road sign say?’ --prompt ‘what kind of environment is it?’ --prompt reset --prompt /data/images/lake.jpg --prompt ‘please describe the scene.’ --prompt ‘are there any hazards to be aware of?’
    /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
    warnings.warn(
    Fetching 13 files: 0%| | 0/13 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
    warnings.warn(
    Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 48167.80it/s]
    Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 36944.65it/s]
    18:30:01 | INFO | loading /data/models/huggingface/models–Efficient-Large-Model–VILA1.5-3b/snapshots/699b413ed13620957e955bd7fb938852afa258fc with MLC
    18:30:03 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors

Using path “/data/models/mlc/dist/models/VILA1.5-3b” for model “VILA1.5-3b”
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights… This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 1%|█ | 2/197 [00:02<03:52, 1.19s/tensors]Traceback (most recent call last): | 1/327 [00:02<15:28, 2.85s/tensors]
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/NanoLLM/nano_llm/chat/main.py”, line 30, in
model = NanoLLM.from_pretrained(
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 73, in from_pretrained
model = MLCModel(model_path, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 277, in quantize
subprocess.run(cmd, executable=‘/bin/bash’, shell=True, check=True)
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors ’ died with <Signals.SIGKILL: 9>.
kalustian@ubuntu:~$

@kalustian your board keeps running out of memory, on processes that other Orin Nano users have been able to run (perhaps this related to why you were unable to run the MiniGPT4 example too). Can you try mounting SWAP, disabling ZRAM, and if necessary disabling the desktop UI like shown here:

Eureka moment ?

Well, I started again from ground zero , re-image the NVme card, and applied your suggested RAM optimizations settings. Now when I run the commands under the " Multimodal Chat" option , I copy/pasted the commands:

jetson-containers run $(autotag nano_llm)
python3 -m nano_llm.chat --api=mlc
–model Efficient-Large-Model/VILA1.5-3b
–max-context-len 256
–max-new-tokens 32

I got below results in the terminal:

PROMPT: Who was the First successful ascent to the summit of Mount Everest? At what time they reached the summit?

The first successful ascent to the summit of Mount Everest was made by a team of three climbers:

  1. Sir Edmund Hill

┌───────────────┬─────────────┐
│ embed_time │ 0.000218919 │
├───────────────┼─────────────┤
│ input_tokens │ 29 │
├───────────────┼─────────────┤
│ output_tokens │ 32 │
├───────────────┼─────────────┤
│ prefill_time │ 0.0175087 │
├───────────────┼─────────────┤
│ prefill_rate │ 1656.32 │
├───────────────┼─────────────┤
│ decode_time │ 1.07084 │
├───────────────┼─────────────┤
│ decode_rate │ 29.883 │
└───────────────┴─────────────┘

QUESTION: is this the expected result ?

Thanks

Sure @kalustian, there was no image you put into that chat, and it cut off the reply early because the maximum generation length was set to 32 tokens.

No worries, I can be able to make it work - thanks a lot for your help & support

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.