VILA1.5-3b quantize failed

I am attempting to run the example:
python3 -m nano_llm.chat --api=mlc --model /data/models/VILA1.5-3b --max-context-len 256 --max-new-tokens 32

I got the following output:

06:05:58 | INFO | loading /data/models/VILA1.5-3b with MLC
06:05:59 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors 


Using path "/data/models/mlc/dist/models/VILA1.5-3b" for model "VILA1.5-3b"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param:   0%|                              | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.0<?, ?tensors/s]
Get old param:   1%|▏                     | 2/197 [00:03<06:00,  1.85s/tensors]Traceback (most recent call last):        | 1/327 [00:03<20:20,  3.74s/tensors]
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/build.py", line 47, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/build.py", line 43, in main
    core.build_model_from_args(parsed_args)
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/core.py", line 884, in build_model_from_args
    params = utils.convert_weights(mod_transform, param_manager, params, args)
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/utils.py", line 286, in convert_weights
    vm["transform_params"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.8/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/utils.py", line 46, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/relax_model/param_manager.py", line 599, in get_item
    load_torch_params_from_bin(torch_binname)
  File "/usr/local/lib/python3.8/dist-packages/mlc_llm/relax_model/param_manager.py", line 557, in load_torch_params_from_bin
    torch_params = self.safetensors_load_func(torch_binpath)
  File "/usr/local/lib/python3.8/dist-packages/safetensors/torch.py", line 311, in load_file
    with safe_open(filename, framework="pt", device=device) as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/NanoLLM/nano_llm/chat/__main__.py", line 30, in <module>
    model = NanoLLM.from_pretrained(
  File "/opt/NanoLLM/nano_llm/nano_llm.py", line 73, in from_pretrained
    model = MLCModel(model_path, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 60, in __init__
    quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 277, in quantize
    subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)  
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 --use-safetensors ' returned non-zero exit status 1.

I have tried disabling ZRAM and mounting additional swap and --vision-api=hf,it does not working

Hi @1147825709, it seems like something may have gotten corrupted in your model download - can you try removing it, then run the command again?

sudo rm -rf jetson-containers/data/models/huggingface/models--Efficient-Large-Model--VILA1.5-3b
1 Like

Yes, I solved the problem, thank you。

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.