I’m performing a simple binary classification task using the VILA1.5-3b model on a Jetson Orin Nano 8GB via nano_llm (MLC API). The goal is to force a definitive ‘YES’ or ‘NO’ response, but the model fails to adhere to the format.
Problem Description
Despite using a clear system prompt and a prompt anchor (… Answer: ), the model consistently outputs non-text data or incorrect formatting, such as:
Immediate termination:
Token IDs: 1 or 0
List markers: -
The model is struggling to generate the required two-to-three character string (YES/NO).
Environment and Command
Hardware: Jetson Orin Nano 8GB
Model: Efficient-Large-Model/VILA1.5-3b
API: mlc
Current Command:
python3 -m nano_llm.chat --api=mlc \
--model Efficient-Large-Model/VILA1.5-3b \
--quantization q4f16_ft \
--max-context-len 256 \
--max-new-tokens 16 \
--vision-scaling resize \
--system-prompt "You are an expert vision model. Respond to the user's question with only the English word 'YES' or 'NO'." \
--prompt '/data/images/sample.jpg' \
--prompt 'Is there an object placed in front of the cardboard divider in this image? Answer: '
Questions
Fixed Output Format: How can we ensure the VILA1.5-3b model, when run with MLC/nano_llm, reliably generates only the string ‘YES’ or ‘NO’ and nothing else, preventing the premature termination and unexpected token output?
–vision-scaling Default: The default setting for --vision-scaling is crop. Specifically, what cropping method (e.g., center crop, random crop) is implemented when this default is used?
This is related to the prompt.
It’s not guaranteed that the output will always to yes or no.
But you can set a simpler checker to validate. If not, then run the inference to get a new output again.
Could you share which document mentions the vision-scaling?
Thank you for your response. Based only on the information below, I have summarized my understanding and follow-up questions for you.
1. Regarding the Output Response
I understand that strictly limiting the model output to just ‘YES’ or ‘NO’ is inherently difficult.
For inputs where the correct answer should have been ‘YES’, the model occasionally responded with ‘1’ instead. I suspect this might be due to the reason you mentioned. (I haven’t yet observed the inverse: inputs that should be ‘NO’ responding with ‘0’.)
Also, regarding the proposed “simpler checker,” are you suggesting that we implement an external converter to change outputs like ‘0’, ‘False’, or ‘NG’ to ‘NO’, and ‘1’, ‘True’, or ‘OK’ to ‘YES’?
2. Regarding --vision-scaling
I found the --vision-scaling argument in the documentation below:
According to this documentation, it appears the default is resize, and choosing crop results in a center-crop.
I was looking through the arguments because I suspected that if the feature map isn’t retained in the area affected by cropping, the model might fail to make a correct judgment.
Are there any other effective arguments that you recommend I review or change to address this problem?