New VILA-1.5 multimodal vision/language models released in 3B, 8B, 13B, 40B

We’ve released new VILA models with improved accuracy and speed - up to 7.5 FPS on Orin!

These are supported in the latest 24.5 release of NanoLLM:

If you already have the nano_llm container on your system, do a docker pull dustynv/nano_llm:r36.2.0 (or r35.4.1) and then you should be able to run this along with the other VLM demos:

jetson-containers run $(autotag nano_llm) \
  python3 -m --api=mlc \
    --model Efficient-Large-Model/VILA1.5-3b \
    --prompt /data/prompts/images.json

It now also uses TensorRT to accelerate the CLIP/SigLIP vision encoder in the pipeline 👍