Hi, recently I tried new Llama3-V-2_5 int4 quantization on Jetson Orin AGX using huggingface transformers. It runs on continuous stream and it performs decent in terms of speed and just ~5GB RAM. It is any way to optimize the inference for a better FPS? (like tensorRT engine?). Below is my modified script for it based on video.py from @dusty_nv containers of nanollm. You need to install transformers for hf and build from source bitsandbites (Installation). All was done inside containers.
GitHub - AnielAlexa/NanoLLM: Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
Hi @alexaaniel, cool! If you are able to get it loading through NanoLLM with --api mlc --quantization q4f16_ft
, that should significantly speed it up (bitsandbytes quantization is slow)
You will want to step through NanoLLM.config_vision()
and NanoLLM.init_vision()
functions to make sure it can load the model. It will attempt to load the VLMâs LLM with MLC/AWQ, and load the VLMâs vision encoder with TensorRT.
If you can get it working and tested with nano_llm.chat and nano_llm.vision.example, would be happy to accept a PR for it! Thanks and good luck!
Thank you for your answerïŒ According to your answer, I have modified config.vision() so that MLC can quant the llm part. However, in MiniCPM-Llama3-V-2_5, the weight parameters of LLM and SigLIP are stored together and not separated. May I ask how to modify the init-vision()? openbmb/MiniCPM-Llama3-V-2_5 · Hugging Face
@john_c maybe to load the combined weights and than filter by examining the parameter names and or specific markers that distinguish between the LLM and SigLIP parameters.
maybe this helps modeling_minicpmv.py · openbmb/MiniCPM-Llama3-V-2_5 at main
Thank you, I will try this method.
@john_c if you want you can share the work to speed up the things, Iâm also working on this now. Thanx!
@john_c Did you change config to quantize as llama?
yesïŒi have modified config_vision(), like this
if arch == 'minicpmv':
llm_path = os.path.join(self.model_path, 'llm')
llm_config = os.path.join(llm_path, 'config.json')
print("minicpmv arch")
if not os.path.isdir(llm_path):
os.makedirs(llm_path)
text_config = {'model_type': 'llama', 'torch_type': 'float16', 'vocab_size': 128256}
self.patch_config(load=download_model(os.path.join('unsloth/llama-3-8b-Instruct', 'config.json')),
save=llm_config, **text_config,)
with open(llm_config) as f:
print("json file path: ", llm_config)
json_file = json.load(f)
print(json_file)
json_file = filter_keys(json_file, keep=['max_position_embeddings', 'vocab_size'])
print(json_file)
# self.patch_config(**filter_keys(json.load(file), keep=['max_position_embeddings', 'vocab_size']))
self.patch_config(**json_file)
print(self.model_path)
for tokenizer in glob.glob(os.path.join(self.model_path, 'tokenizer*')):
print(tokenizer)
shutil.copy(tokenizer, llm_path)
# copy modeling_minicpmv.py configuration_minicpmv.py resampler.py
rename_weights(self.model_path, llm_path, lambda layer: layer.replace("llm.", ''))
self.model_path = llm_path
Generated llm-q5f16_ft-cuda.so
file in `/data/models/mlc/dist/-ctx8192/llm-q4f16_ft/
the build command looks like this
python3 -m mlc_llm.build --model /data/models/mlc/dist/models/MiniCPM-Llama3-V-2_5 --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 8192 --artifact-path /data/models/mlc/dist/MiniCPM-Llama3-V-2_5-ctx8192 --use-safetensors
it should be /data/models/mlc/dist/models/MiniCPM-Llama3-V-2_5/llm
, but I have no chance to set it try to quantize with the original config(which is not llama)
Manually I manage to quantize the model.
thank you!
@dusty_nv can be an option to try with llama.cpp like the video with gemma 2? How much is the difference between this approach and MLC ? I manage to quantizize the llm part (it works and did some text inference) and add their custom vision encoder, but the tensors for embedding do not match
@alexaaniel IMO llama.cpp support for VLMâs is hit-or-miss and will require you to go through a similar process to quantize them (unless it is already supported by llama.cpp)
âŠwhich actually the MimiCPM model card does call out support for Llama.cpp - you might just be in luck! llama.cpp/examples/minicpmv/README.md at minicpm-v2.5 · OpenBMB/llama.cpp · GitHub
So yes, if you are still having problems, try that - while llama.cpp may not be as fast as MLC/TVM or AWQ, it does have broad adoption. And while I wish I could personally support more of these awesome VLMs, the ones with incremental improvements over Llava versus brand-new vision capabilities get the priority. Unfortunately they all seem to require some degree of massaging to get quantized since there is no standard âpipelineâ of how they chain together their vision encoders and LLMs.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.