VILA with VIA [New]

sochoa · September 3, 2024, 10:44pm

VIA with VILA

The family of research VILA models can now be used as a locally deployed VLM for video summarization with VIA. The research VILA models are open source and publicly available on GitHub and Hugging Face. They come in several sizes from 3B to 40B. This post shows how to deploy a local VILA VLM server and configure VIA to use it for video summarization. This provides an alternative to using GPT4o or VITA-2.0 for the VLM.

To use VILA with VIA follow these steps:

1) Prerequisites

GPU with Ampere architecture or later (A6000, A100, H100, L40S)
Linux
NVIDIA Container Toolkit

2) Setup VILA VLM Server

Clone the VILA GitHub repository

git clone https://github.com/NVlabs/VILA

Build VILA Server Container

cd VILA
docker build -t vila-server:latest .

Choose one of the following to launch the VILA server with your desired model size:

Efficient-Large-Model/VILA1.5-3B

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v ./hub:/root/.cache/huggingface/hub \
    -it --rm -p 8000:8000 \
    -e VILA_MODEL_PATH=Efficient-Large-Model/VILA1.5-3B \
    -e VILA_CONV_MODE=vicuna_v1 \
    vila-server:latest

Efficient-Large-Model/Llama-3-VILA1.5-8B

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v ./hub:/root/.cache/huggingface/hub \
    -it --rm -p 8000:8000 \
    -e VILA_MODEL_PATH=Efficient-Large-Model/Llama-3-VILA1.5-8B \
    -e VILA_CONV_MODE=llama_3 \
    vila-server:latest

Efficient-Large-Model/VILA1.5-13B

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v ./hub:/root/.cache/huggingface/hub \
    -it --rm -p 8000:8000 \
    -e VILA_MODEL_PATH=Efficient-Large-Model/VILA1.5-13B \
    -e VILA_CONV_MODE=vicuna_v1 \
    vila-server:latest

Efficient-Large-Model/VILA1.5-40B

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v ./hub:/root/.cache/huggingface/hub \
    -it --rm -p 8000:8000 \
    -e VILA_MODEL_PATH=Efficient-Large-Model/VILA1.5-40B \
    -e VILA_CONV_MODE=hermes-2 \
    vila-server:latest

After running one of the above commands, it will automatically download the VILA model and launch an OpenAI compatible server to use for inferencing VILA.

3) Configure and launch VIA

Once the VILA server has successfully launched, VIA can be configured to use it instead of the built in VITA 2.0 model or GPT4o.

First set the following environment variables:

Variable	Description
VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME	Must match the VILA model name launched in Step 1 not including “Efficient-Large-Model/”
CA_RAG_CONFIG_FILE_ON_HOST	Local path to CA RAG config file
CA_RAG_CONFIG_FILE_IN_CONTAINER	In container path to CA RAG config file
NVIDIA_API_KEY	API Key from build.nvidia.com
BACKEND_PORT	Desired port to run backend VIA server. No need to change.
FRONTEND_PORT	Desired port to run frontend VIA server. No need to change.
OPENAI_API_KEY	A fake key to access the locally deployed VILA server. Can be any string as the VILA server does not have authentication.
VLM_MODEL_TO_USE	Must be set to “openai-compat”
VIA_VLM_ENDPOINT	Must be set to VILA server host

The following sample can be used. Be sure to adjust the model name, config paths and api key.

export VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME=VILA1.5-3B
export CA_RAG_CONFIG_FILE_ON_HOST=/home/config.yaml
export CA_RAG_CONFIG_FILE_IN_CONTAINER=/config.yaml
export NVIDIA_API_KEY=nvapi-***
export BACKEND_PORT=31000
export FRONTEND_PORT=31009
export OPENAI_API_KEY=fake_key
export VLM_MODEL_TO_USE=openai-compat
export VIA_VLM_ENDPOINT=http://localhost:8000

Then launch VIA

docker run --rm -it --ipc=host --ulimit memlock=-1 \
 --ulimit stack=67108864 --tmpfs /tmp:exec --name via-server --net="host" \
 --gpus '"device=all"' \
 -p $FRONTEND_PORT:$FRONTEND_PORT \
 -p $BACKEND_PORT:$BACKEND_PORT \
 -e BACKEND_PORT=$BACKEND_PORT \
 -e FRONTEND_PORT=$FRONTEND_PORT \
 -e NVIDIA_API_KEY=$NVIDIA_API_KEY \
 -e OPENAI_API_KEY=$OPENAI_API_KEY \
 -e VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME=$VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME \
 -e VLM_MODEL_TO_USE=$VLM_MODEL_TO_USE \
 -e VIA_VLM_ENDPOINT=$VIA_VLM_ENDPOINT \
 -e NVIDIA_API_KEY=$NVIDIA_API_KEY \
 -e CA_RAG_CONFIG=$CA_RAG_CONFIG_FILE_IN_CONTAINER \
 -e VIA_DEV_API=1 \
 -v via-hf-cache:/tmp/huggingface \
 -v $CA_RAG_CONFIG_FILE_ON_HOST:$CA_RAG_CONFIG_FILE_IN_CONTAINER \
 nvcr.io/metropolis/via-dp/via-engine:2.0-dp

Note that --net=“host” is added so the VIA container can access the locally deployed VILA server.

Once VIA is fully loaded, you can access the VIA WebUI and summarize videos with a locally deployed VILA model!

Notes

Memory Usage

The peak memory usage was measured while summarizing a 2 minute video with VIA and locally deployed VILA models on an A6000 (48GB VRAM). Use these memory measurements to help decide which VILA model to use based on your System’s VRAM.

Deployment	Peak VRAM Usage
VIA + VILA1.5-3B	12.6 GB
VIA + VILA1.5-8B	22.5 GB
VIA + VILA1.5-13B	37.85 GB
VIA + VILA1.5-40B	OOM

VLM Comparison

To get an idea of how the summarization quality is affected by choice of VLM, the summarization results of GPT4o, VITA-2.0, VILA1.5-3B, VILA1.5-8B and VILA1.5-13B are shown below.

Each was used to summarize a traffic video with the same input prompt. The notable events in the traffic video used as input is a car driving in the wrong direction and a woman stepping out of her car into traffic.

Prompt:
caption: “Summarize this traffic crossing video, focusing on unsafe or risk events such as vehicles going in the wrong direction which could cause an accident.”

caption_summarization: “You should summarize the following events of a traffic crossing video in the format start_time:end_time:caption. If during a time segment only regular activities happen, then ignore them, else note any irregular activities like vehicles going in the wrong opposite direction in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don’t return anything else except the bullet points.”

summary_aggregation: “You are an intelligent traffic monitoring system trying to detect vehicles that are going in the wrong direction. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. The output should only contain bullet points.”

Results:

GPT4o

Here are the aggregated captions in the format start_time:end_time:event_description:
• 15.0:16.0: Person exits a car in the middle of traffic.
• 17.099:19.099: Person walks away from the car, standing in front of it, posing a hazard to oncoming traffic.
• 20.099:39.099: Person stands in front of a white car in the left lane, walks towards the left side of the road, and interacts with the driver, posing a risk to traffic flow.
• 40.099:45.103: Person walks in the middle of the road between lanes of traffic, posing a potential hazard to themselves and the vehicles.

VITA-2.0

Here are the aggregated captions in the format start_time:end_time:event_description:
• 0.1:2.7:Wrong-way vehicle detected
• 2.8:9.8:Multiple wrong-way vehicles detected
• 10.1:12.5:Wrong-way vehicle detected in traffic
• 30.2:32.8:Wrong-way vehicle and large truck detected
• 32.8:40.0:Accident involving wrong-way vehicle and large truck

VILA1.5-3B

• 10.0:10.0: Vehicle traveling in wrong direction

VILA1.5-8B

Here are the aggregated captions in the format start_time:end_time:event_description:
• 0.00:10.00: Vehicles going in the wrong direction.
• 10.00:30.00: Vehicles going in the wrong direction, risking an accident.
• 40.00:45.60: Vehicles going in the wrong direction, person standing in the middle of the road, dangerous situation.

VILA1.5-13B

Here are the aggregated captions in the format start_time:end_time:event_description:
• 10.00:40.00: Vehicle traveling in wrong direction
• 50.00:70.80: Vehicle traveling in wrong direction, almost causing accidents

In general, the larger the model the higher quality output. VILA1.5-13B is a good alternative to VITA-2.0 and GPT4o. VILA1.5-40B is also a great option for high quality output but does require significant compute. The 3B and 8B models may hallucinate or miss some details in the video.

natejwalter · September 20, 2024, 5:32am

Hey there, awesome stuff! I was wondering if there was an updated Docker command for the newer LongVILA models like Efficient-Large-Model/Llama-3-LongVILA-8B-1024Frames. Or if the newer model can just be substituted for the regular VILA models such as Efficient-Large-Model/VILA1.5-3b.

sochoa · September 20, 2024, 2:37pm

Hello, currently it will not work if you use one of the long vila models. This is something I’m looking into :). I’ll leave an update when I have more information.

Thanks,
Sammy Ochoa

natejwalter · September 21, 2024, 4:34am

Great! Looking forward to it

kelvin.lwin · December 24, 2024, 5:07am

Could you share sample config.yaml?

Also have you seen this error?
openai.InternalServerError: Error code: 500 - {‘error’: “‘NoneType’ object is not iterable”}

Topic		Replies	Views
Build VLM-Powered Visual AI Agents Using NVIDIA NIM and NVIDIA VIA Microservices Technical Blog nim	3	100	August 28, 2024
Error while downloading VIA Visual AI Agent llama	20	283	September 23, 2024
VSS blueprint 2.2.0 - ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine Visual AI Agent nim , llama-31-70b-instruct , llama	16	251	April 22, 2025
Visual Language Models on NVIDIA Hardware with VILA Technical Blog	2	253	May 3, 2024
Unable to configure gpt-4o with VSS instead of vila using Openai-azure API key Visual AI Agent	15	75	May 6, 2025
Video Search Summarization Models fail to download Visual AI Agent inception , nim , paligemma , kosmos-2 , llama	3	62	March 7, 2025
VIA Summarization Workflow ERROR Visual AI Agent llama	34	326	March 5, 2025
Vila를 사용하는 nvidia 하드웨어의 시각적 언어 모델 Technical Blog - South Korea	1	139	May 17, 2024
VILA docker issue Visual AI Agent nvbugs , llama	5	136	February 10, 2025
VSS blueprint 2.2.0 - processing, percentage complete is 0.00 forever Visual AI Agent	8	88	March 6, 2025