Error during quantization step in VideoQuery example on Jetson Orin NX

I’m encountering an issue while running the VideoQuery example on my Jetson Orin NX (8GB RAM, 500GB SSD). The process fails during the quantization step. I’m using the jetson-containers auto-choose Docker container.

Setup:

Jetson Orin NX 8GB with 500GB SSD
JetPack 6 and L4T 36.4.0
Running in a Docker container using jetson-containers

Command executed:

jetson-containers run $(autotag nano_llm)
python3 -m nano_llm.agents.video_query --api=mlc
–model Efficient-Large-Model/VILA1.5-3b
–max-context-len 256
–max-new-tokens 32
–video-input /dev/video0
–video-output webrtc://@:8554/output
–nanodb /data/nanodb/coco/2017

python3 -m nano_llm.agents.video_query --api=mlc
–model Efficient-Large-Model/VILA1.5-3b
–max-context-len 256
–max-new-tokens 32
–video-input /dev/video0
–video-output webrtc://@:8554/output
–nanodb /data/nanodb/coco/2017
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
Fetching 13 files: 0%| | 0/13 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1142: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 35429.47it/s]
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 110035.75it/s]
16:01:44 | INFO | loading /data/models/huggingface/models–Efficient-Large-Model–VILA1.5-3b/snapshots/42d1dda6807cc521ef27674ca2ae157539d17026 with MLC
16:01:48 | INFO | NumExpr defaulting to 6 threads.
16:01:48 | WARNING | AWQ not installed (requires JetPack 6 / L4T R36) - AWQ models will fail to initialize
[‘/data/models/mlc/dist/VILA1.5-3b /ctx256/VILA1.5-3b-q4f16_ft/mlc-chat-config.json’, ‘/data/models/mlc/dist/VILA1.5-3b/ctx256/VILA1.5-3b-q4f16_ft/params/mlc-chat-config.json’]
16:01:50 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b/ctx256 --use-safetensors

Using path “/data/models/mlc/dist/models/VILA1.5-3b” for model “VILA1.5-3b”
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights… This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 1%|▉ | 2/197 [00:02<03:22, 1.04s/tensors]Traceback (most recent call last): | 1/327 [00:02<13:23, 2.47s/tensors]
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/NanoLLM/nano_llm/agents/video_query.py”, line 357, in
agent = VideoQuery(**vars(args)).run()
File “/opt/NanoLLM/nano_llm/agents/video_query.py”, line 44, in init
self.llm = ChatQuery(model=model, drop_inputs=True, vision_scaling=vision_scaling, warmup=True, **kwargs) #ProcessProxy(‘ChatQuery’, model=model, drop_inputs=True, vision_scaling=vision_scaling, warmup=True, **kwargs)
File “/opt/NanoLLM/nano_llm/plugins/chat_query.py”, line 78, in init
self.model = NanoLLM.from_pretrained(model, **kwargs)
File “/opt/NanoLLM/nano_llm/nano_llm.py”, line 91, in from_pretrained
model = MLCModel(model_path, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File “/opt/NanoLLM/nano_llm/models/mlc.py”, line 276, in quantize
subprocess.run(cmd, executable=‘/bin/bash’, shell=True, check=True)
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path
/data/models/mlc/dist/VILA1.5-3b/ctx256 --use-safetensors ’ died with <Signals.SIGKILL: 9>.

Any help or suggestions would be appreciated!

Hi,

Signals.SIGKILL is usually triggered by out-of-memory.
Could you follow the below steps for RAM optimization?

Thanks.

Hi AastaLL,

Thank you very much for your guidance! I followed the steps, and it’s working perfectly now.

Good to know it works now!
Thanks for the update.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.