I’m encountering an issue while running the VideoQuery example on my Jetson Orin NX (8GB RAM, 500GB SSD). The process fails during the quantization step. I’m using the jetson-containers auto-choose Docker container.
Setup:
Jetson Orin NX 8GB with 500GB SSD
JetPack 6 and L4T 36.4.0
Running in a Docker container using jetson-containers
Command executed:
jetson-containers run $(autotag nano_llm)
python3 -m nano_llm.agents.video_query --api=mlc
–model Efficient-Large-Model/VILA1.5-3b
–max-context-len 256
–max-new-tokens 32
–video-input /dev/video0
–video-output webrtc://@:8554/output
–nanodb /data/nanodb/coco/2017
python3 -m nano_llm.agents.video_query --api=mlc
–model Efficient-Large-Model/VILA1.5-3b
–max-context-len 256
–max-new-tokens 32
–video-input /dev/video0
–video-output webrtc://@:8554/output
–nanodb /data/nanodb/coco/2017
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: UsingTRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. UseHF_HOME
instead.
warnings.warn(
Fetching 13 files: 0%| | 0/13 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1142: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.
warnings.warn(
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 35429.47it/s]
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 110035.75it/s]
16:01:44 | INFO | loading /data/models/huggingface/models–Efficient-Large-Model–VILA1.5-3b/snapshots/42d1dda6807cc521ef27674ca2ae157539d17026 with MLC
16:01:48 | INFO | NumExpr defaulting to 6 threads.
16:01:48 | WARNING | AWQ not installed (requires JetPack 6 / L4T R36) - AWQ models will fail to initialize
[‘/data/models/mlc/dist/VILA1.5-3b /ctx256/VILA1.5-3b-q4f16_ft/mlc-chat-config.json’, ‘/data/models/mlc/dist/VILA1.5-3b/ctx256/VILA1.5-3b-q4f16_ft/params/mlc-chat-config.json’]
16:01:50 | INFO | running MLC quantization:python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b/ctx256 --use-safetensors
Using path “/data/models/mlc/dist/models/VILA1.5-3b” for model “VILA1.5-3b”
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights… This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 1%|▉ | 2/197 [00:02<03:22, 1.04s/tensors]Traceback (most recent call last): | 1/327 [00:02<13:23, 2.47s/tensors]
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/NanoLLM/nano_llm/agents/video_query.py”, line 357, in
agent = VideoQuery(**vars(args)).run()
File “/opt/NanoLLM/nano_llm/agents/video_query.py”, line 44, in init
self.llm = ChatQuery(model=model, drop_inputs=True, vision_scaling=vision_scaling, warmup=True, **kwargs) #ProcessProxy(‘ChatQuery’, model=model, drop_inputs=True, vision_scaling=vision_scaling, warmup=True, **kwargs)
File “/opt/NanoLLM/nano_llm/plugins/chat_query.py”, line 78, in init
self.model = NanoLLM.from_pretrained(model, **kwargs)
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 91, in from_pretrained
model = MLCModel(model_path, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 276, in quantize
subprocess.run(cmd, executable=‘/bin/bash’, shell=True, check=True)
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path
/data/models/mlc/dist/VILA1.5-3b/ctx256 --use-safetensors ’ died with <Signals.SIGKILL: 9>.
Any help or suggestions would be appreciated!