under the " Multimodal Chat" option , I copy/pasted the commands:
jetson-containers run $(autotag nano_llm)
python3 -m nano_llm.chat --api=mlc
–model Efficient-Large-Model/VILA1.5-3b
–max-context-len 256
–max-new-tokens 32
After few minutes I got below result:
dustynv/nano_llm:24.5.1-r36.2.0
- sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/kalustian/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock dustynv/nano_llm:24.5.1-r36.2.0 python3 -m nano_llm.chat --api=mlc --model Efficient-Large-Model/VILA1.5-3b --max-context-len 256 --max-new-tokens 32
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. Use HF_HOME
instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 48990.07it/s]
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 45473.96it/s]
18:21:39 | INFO | loading /data/models/huggingface/models–Efficient-Large-Model–VILA1.5-3b/snapshots/699b413ed13620957e955bd7fb938852afa258fc with MLC
18:21:40 | INFO | running MLC quantization:
python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors
Using path “/data/models/mlc/dist/models/VILA1.5-3b” for model “VILA1.5-3b”
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights… This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 2%|█▏ | 3/197 [00:02<02:25, 1.33tensors/s]Traceback (most recent call last): | 1/327 [00:02<15:30, 2.85s/tensors]
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/NanoLLM/nano_llm/chat/main.py”, line 30, in
model = NanoLLM.from_pretrained(
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 73, in from_pretrained
model = MLCModel(model_path, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 277, in quantize
subprocess.run(cmd, executable=‘/bin/bash’, shell=True, check=True)
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors ’ died with <Signals.SIGKILL: 9>.
kalustian@ubuntu:~$
I was expecting the interactive console-based chat with Llava to start…well looks like no.
I also tried the " Automated Prompts" option…and here is what I see in the terminal:
Namespace(packages=[‘nano_llm’], prefer=[‘local’, ‘registry’, ‘build’], disable=[‘’], user=‘dustynv’, output=‘/tmp/autotag’, quiet=False, verbose=False)
– L4T_VERSION=36.3.0 JETPACK_VERSION=6.0 CUDA_VERSION=12.2
– Finding compatible container image for [‘nano_llm’]
[sudo] password for kalustian:
dustynv/nano_llm:24.5.1-r36.2.0
- sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/kalustian/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock dustynv/nano_llm:24.5.1-r36.2.0 python3 -m nano_llm.chat --api=mlc --model Efficient-Large-Model/VILA1.5-3b --max-context-len 256 --max-new-tokens 32 --prompt /data/images/hoover.jpg --prompt ‘what does the road sign say?’ --prompt ‘what kind of environment is it?’ --prompt reset --prompt /data/images/lake.jpg --prompt ‘please describe the scene.’ --prompt ‘are there any hazards to be aware of?’
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. Use HF_HOME
instead.
warnings.warn(
Fetching 13 files: 0%| | 0/13 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 48167.80it/s]
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 36944.65it/s]
18:30:01 | INFO | loading /data/models/huggingface/models–Efficient-Large-Model–VILA1.5-3b/snapshots/699b413ed13620957e955bd7fb938852afa258fc with MLC
18:30:03 | INFO | running MLC quantization:
python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors
Using path “/data/models/mlc/dist/models/VILA1.5-3b” for model “VILA1.5-3b”
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights… This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 1%|█ | 2/197 [00:02<03:52, 1.19s/tensors]Traceback (most recent call last): | 1/327 [00:02<15:28, 2.85s/tensors]
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/NanoLLM/nano_llm/chat/main.py”, line 30, in
model = NanoLLM.from_pretrained(
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 73, in from_pretrained
model = MLCModel(model_path, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 277, in quantize
subprocess.run(cmd, executable=‘/bin/bash’, shell=True, check=True)
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors ’ died with <Signals.SIGKILL: 9>.
kalustian@ubuntu:~$