Error on following "NanoVLM - Efficient Multimodal Pipeline"

Hello. I got an “Exception error”, when I was following your Tutorial on Small Language Models (SLM) with Jetson Orin Nano (8GB) (with 128GB SD card)
(NanoVLM - NVIDIA Jetson AI Lab).
When I followed the first command according to the tutorial, there was a problem with a few minutes of screen freezing

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-context-len 256 \
    --max-new-tokens 32

I received the following error message.

seongkyu@ubuntu:~$ jetson-containers run $(autotag nano_llm) \
>   python3 -m nano_llm.chat --api=mlc \
>     --model Efficient-Large-Model/VILA1.5-3b \
>     --max-context-len 256 \
>     --max-new-tokens 32
Namespace(disable=[''], output='/tmp/autotag', packages=['nano_llm'], prefer=['local', 'registry', 'build'], quiet=False, user='dustynv', verbose=False)
-- L4T_VERSION=35.5.0  JETPACK_VERSION=5.1  CUDA_VERSION=11.4
-- Finding compatible container image for ['nano_llm']
dustynv/nano_llm:r35.4.1
[sudo] password for seongkyu:
localuser:root being added to access control list
+ docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/seongkyu/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb -e DISPLAY=:1 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth --device /dev/video0 --device /dev/video1 --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-3 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-6 --device /dev/i2c-7 --device /dev/i2c-8 --device /dev/i2c-9 dustynv/nano_llm:r35.4.1 python3 -m nano_llm.chat --api=mlc --model Efficient-Large-Model/VILA1.5-3b --max-context-len 256 --max-new-tokens 32
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 122.77it/s]
Fetching 17 files:   0%|                                                                                                                                                            | 0/17 [00:00<?, ?it/s]
llm/model-00001-of-00002.safetensors:  59%|████████████████████████████████████████████████████████████████████████▋                                                   | 2.92G/4.97G [39:59<48:23, 709kBllm/model-00001-of-00002.safetensors:  59%|████████████████████████████████████████████████████████████████████████▉                                                   | 2.93G/4.97G [40:02<48:59, 697kBllm/model-00001-of-00002.safetensors:  59%|█████████████████████████████████████████████████████████████████████████▏                                                  | 2.94G/4.97G [40:17<48:17, 703kBllm/model-00001-of-00002.safetensors:  59%|█████████████████████████████████████████████████████████████████████████▏                                                  | 2.94G/4.97G [40:29<48:17, 703kBllm/model-00llm/model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.97G/4.97G [1:12:51<00:00, 721kB/s]
Fetching 17 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [1:12:51<00:00, 257.17s/it]
09:01:17 | INFO | loading /data/models/huggingface/models--Efficient-Large-Model--VILA1.5-3b/snapshots/699b413ed13620957e955bd7fb938852afa258fc with MLC
09:01:20 | INFO | backing up original model config to /data/models/huggingface/models--Efficient-Large-Model--VILA1.5-3b/snapshots/699b413ed13620957e955bd7fb938852afa258fc/config.json.backup
09:01:20 | INFO | patching model config with {'model_type': 'llama'}
09:01:20 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256


Using path "/data/models/mlc/dist/models/VILA1.5-3b" for model "VILA1.5-3b"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/build.py", line 47, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/build.py", line 43, in main
    core.build_model_from_args(parsed_args)
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/core.py", line 834, in build_model_from_args
    mod, param_manager, params, model_config = model_generators[args.model_category].get_model(
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/relax_model/llama.py", line 1333, in get_model
    raise Exception(
Exception: The model config should contain information about maximum sequence length.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/NanoLLM/nano_llm/chat/__main__.py", line 29, in <module>
    model = NanoLLM.from_pretrained(
  File "/opt/NanoLLM/nano_llm/nano_llm.py", line 71, in from_pretrained
    model = MLCModel(model_path, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 59, in __init__
    quant = MLCModel.quantize(model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 278, in quantize
    subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)  
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 ' returned non-zero exit status 1.

How to do for this error and follow your NanoVLM tutorial well?

Closing this as duplicate of: