Failed to MLC-compile mlc-ai/Llama-3.1-8B-Instruct-fp8-MLC on Jetson AGX orin

pcha · January 9, 2025, 8:34pm

I’ve tried using mlc-ai/Llama-3.1-8B-Instruct-fp8-MLC on Jetson AGX orin but it fails at the compilation step.

I’ve raised an issue in MLC community but I’m posting it here as well hoping if I could get an insight from Jetson side.

github.com/mlc-ai/mlc-llm

[Bug] Failed to compile mlc-ai/Llama-3.1-8B-Instruct-fp8-MLC on Jetson AGX orin

opened 08:29PM - 09 Jan 25 UTC

phgcha

bug

## 🐛 Bug As the title says, compiling mlc-ai/Llama-3.1-8B-Instruct-fp8-MLC …on Jetson AGX orin fails. ## To Reproduce Steps to reproduce the behavior: 1. Clone mlc-ai/Llama-3.1-8B-Instruct-fp8-MLC (with git lfs) ``` git clone https://huggingface.co/mlc-ai/Llama-3.1-8B-Instruct-fp8-MLC ``` 1. Run the compile command ``` mlc_llm compile dist/Llama-3.1-8B-Instruct-fp8-MLC/ -o dist/Llama-3.1-8B-Instruct-fp8-MLC/lib.so ``` The error message ``` Traceback (most recent call last): File "/usr/local/bin/mlc_llm", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 33, in main cli.main(sys.argv[2:]) File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/compile.py", line 129, in main compile( File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/compile.py", line 243, in compile _compile(args, model_config) File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/compile.py", line 188, in _compile args.build_func( File "/usr/local/lib/python3.10/dist-packages/mlc_llm/support/auto_target.py", line 311, in build relax.build( File "/usr/local/lib/python3.10/dist-packages/tvm/relax/vm_build.py", line 353, in build return _vmlink( File "/usr/local/lib/python3.10/dist-packages/tvm/relax/vm_build.py", line 249, in _vmlink lib = tvm.build( File "/usr/local/lib/python3.10/dist-packages/tvm/driver/build_module.py", line 297, in build rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host) File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__ File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3 File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err tvm.error.InternalError: Traceback (most recent call last): [bt] (8) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Array<tvm::tir::Stmt, std::enable_if<std::is_base_of<tvm::runtime::ObjectRef, tvm::tir::Stmt>::value, void>::type> tvm::tir::StmtMutator::Internal::MutateArray<tvm::tir::Stmt, tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}>(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, std::enable_if<std::is_base_of<tvm::runtime::ObjectRef, tvm::tir::Stmt>::value, void>::type> const&, tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1})+0x7c) [0xffff6b7213cc] [bt] (7) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::ObjectPtr<tvm::runtime::Object> tvm::runtime::Array<tvm::tir::Stmt, void>::MapHelper<tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}, tvm::tir::Stmt>(tvm::runtime::ObjectPtr<tvm::runtime::Object>, tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1})+0x3d8) [0xffff6b7211b8] [bt] (6) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)+0x78) [0xffff6a883578] [bt] (5) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)+0xec) [0xffff6a88346c] [bt] (4) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::InitVTable()::{lambda(tvm::runtime::ObjectRef const&, tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>*)#15}::_FUN(tvm::runtime::ObjectRef const&, tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>*)+0x3c) [0xffff6a874cbc] [bt] (3) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::tir::ComputeLegalizer::VisitStmt_(tvm::tir::BufferStoreNode const*)+0x5c8) [0xffff6bc1c6f8] [bt] (2) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(+0x261ada8) [0xffff6bc0ada8] [bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff6a857ce8] [bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff6c7f7190] File "/opt/mlc-llm/3rdparty/tvm/src/tir/transforms/unsupported_dtype_legalize.cc", line 330 InternalError: Check failed: (MatchDType(value->dtype)) is false: ``` ## Expected behavior The command should produce `.so` file ## Environment - Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA - Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04 - Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): Jetson AGX orin - How you installed MLC-LLM (`conda`, source): As a docker image [dustynv/mlc:0.1.4-r36.4.2](https://hub.docker.com/r/dustynv/mlc/tags) - How you installed TVM-Unity (`pip`, source): ^^ - Python version (e.g. 3.10): Python 3.10.12 - GPU driver version (if applicable): - CUDA/cuDNN version (if applicable): CUDA 12.6.68 / cuDNN 9.3.0.75 - TVM Unity Hash Tag (`python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"`, applicable if you compile models): 3f30919055d864af3dd03c42b3cb0a878aa2cc25 - Any other relevant information: ## Additional context

carolyuu · January 9, 2025, 9:00pm

Hi,
Here are some suggestions for the common issues:

1. Performance

Please run the below command before benchmarking deep learning use case:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Installation

Installation guide of deep learning frameworks on Jetson:

TensorFlow: Installing TensorFlow for Jetson Platform - NVIDIA Docs
PyTorch: Installing PyTorch for Jetson Platform - NVIDIA Docs
We also have containers that have frameworks preinstalled:
Data Science, Machine Learning, AI, HPC Containers | NVIDIA NGC

3. Tutorial

Startup deep learning tutorial:

Jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson
TensorRT sample: Jetson/L4T/TRT Customized Example - eLinux.org

4. Report issue

If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.

Thanks!

AastaLLL · January 10, 2025, 2:58am

Hi,

Please find below for a sample to run MLC on the Orin.

github.com

dusty-nv/jetson-containers/blob/master/packages/llm/mlc/benchmark.sh

#!/usr/bin/env bash
#
# Llama benchmark with MLC. This script should be invoked from the host and will run 
# the MLC container with the commands to download, quantize, and benchmark the models.
# It will add its collected performance data to jetson-containers/data/benchmarks/mlc.csv 
#
# Set the HUGGINGFACE_TOKEN environment variable to your HuggingFace account token 
# that has been granted access to the Meta-Llama models.  You can run it like this:
#
#    HUGGINGFACE_TOKEN=hf_abc123 ./benchmark.sh meta-llama/Llama-2-7b-hf
#
# If a model is not specified, then the default set of models will be benchmarked.
# See the environment variables below and their defaults for model settings to change.
#
# These are the possible quantization methods that can be set like QUANTIZATION=q4f16_ft
#
#  (MLC 0.1.0) q4f16_0,q4f16_1,q4f16_2,q4f16_ft,q4f16_ft_group,q4f32_0,q4f32_1,q8f16_ft,q8f16_ft_group,q8f16_1
#  (MLC 0.1.1) q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16
#
set -ex

This file has been truncated. show original

Thanks.

pcha · January 10, 2025, 2:21pm

I’ve verified that it’s working with quantization=q416_ft but not with 8bit quantization methods.

It fails at the compilation step

AastaLLL · January 13, 2025, 6:32am

Hi,

It looks like the model can work with q4 but fails with 8-bit quantization.
So the failure might be caused by the model requiring more memory resources than the Jetson device has.

You can verify this by monitoring the system with tegrastats:

$ sudo tegrastats