VILA with VIA [New]

VIA with VILA

The family of research VILA models can now be used as a locally deployed VLM for video summarization with VIA. The research VILA models are open source and publicly available on GitHub and Hugging Face. They come in several sizes from 3B to 40B. This post shows how to deploy a local VILA VLM server and configure VIA to use it for video summarization. This provides an alternative to using GPT4o or VITA-2.0 for the VLM.

To use VILA with VIA follow these steps:

1) Prerequisites

2) Setup VILA VLM Server

Clone the VILA GitHub repository

git clone https://github.com/NVlabs/VILA

Build VILA Server Container

cd VILA
docker build -t vila-server:latest .

Choose one of the following to launch the VILA server with your desired model size:

Efficient-Large-Model/VILA1.5-3B

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v ./hub:/root/.cache/huggingface/hub \
    -it --rm -p 8000:8000 \
    -e VILA_MODEL_PATH=Efficient-Large-Model/VILA1.5-3B \
    -e VILA_CONV_MODE=vicuna_v1 \
    vila-server:latest

Efficient-Large-Model/Llama-3-VILA1.5-8B

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v ./hub:/root/.cache/huggingface/hub \
    -it --rm -p 8000:8000 \
    -e VILA_MODEL_PATH=Efficient-Large-Model/Llama-3-VILA1.5-8B \
    -e VILA_CONV_MODE=llama_3 \
    vila-server:latest

Efficient-Large-Model/VILA1.5-13B

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v ./hub:/root/.cache/huggingface/hub \
    -it --rm -p 8000:8000 \
    -e VILA_MODEL_PATH=Efficient-Large-Model/VILA1.5-13B \
    -e VILA_CONV_MODE=vicuna_v1 \
    vila-server:latest

Efficient-Large-Model/VILA1.5-40B

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v ./hub:/root/.cache/huggingface/hub \
    -it --rm -p 8000:8000 \
    -e VILA_MODEL_PATH=Efficient-Large-Model/VILA1.5-40B \
    -e VILA_CONV_MODE=hermes-2 \
    vila-server:latest

After running one of the above commands, it will automatically download the VILA model and launch an OpenAI compatible server to use for inferencing VILA.

3) Configure and launch VIA

Once the VILA server has successfully launched, VIA can be configured to use it instead of the built in VITA 2.0 model or GPT4o.

First set the following environment variables:

Variable Description
VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME Must match the VILA model name launched in Step 1 not including “Efficient-Large-Model/”
CA_RAG_CONFIG_FILE_ON_HOST Local path to CA RAG config file
CA_RAG_CONFIG_FILE_IN_CONTAINER In container path to CA RAG config file
NVIDIA_API_KEY API Key from build.nvidia.com
BACKEND_PORT Desired port to run backend VIA server. No need to change.
FRONTEND_PORT Desired port to run frontend VIA server. No need to change.
OPENAI_API_KEY A fake key to access the locally deployed VILA server. Can be any string as the VILA server does not have authentication.
VLM_MODEL_TO_USE Must be set to “openai-compat”
VIA_VLM_ENDPOINT Must be set to VILA server host

The following sample can be used. Be sure to adjust the model name, config paths and api key.

export VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME=VILA1.5-3B
export CA_RAG_CONFIG_FILE_ON_HOST=/home/config.yaml
export CA_RAG_CONFIG_FILE_IN_CONTAINER=/config.yaml
export NVIDIA_API_KEY=nvapi-***
export BACKEND_PORT=31000
export FRONTEND_PORT=31009
export OPENAI_API_KEY=fake_key
export VLM_MODEL_TO_USE=openai-compat
export VIA_VLM_ENDPOINT=http://localhost:8000

Then launch VIA

docker run --rm -it --ipc=host --ulimit memlock=-1 \
 --ulimit stack=67108864 --tmpfs /tmp:exec --name via-server --net="host" \
 --gpus '"device=all"' \
 -p $FRONTEND_PORT:$FRONTEND_PORT \
 -p $BACKEND_PORT:$BACKEND_PORT \
 -e BACKEND_PORT=$BACKEND_PORT \
 -e FRONTEND_PORT=$FRONTEND_PORT \
 -e NVIDIA_API_KEY=$NVIDIA_API_KEY \
 -e OPENAI_API_KEY=$OPENAI_API_KEY \
 -e VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME=$VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME \
 -e VLM_MODEL_TO_USE=$VLM_MODEL_TO_USE \
 -e VIA_VLM_ENDPOINT=$VIA_VLM_ENDPOINT \
 -e NVIDIA_API_KEY=$NVIDIA_API_KEY \
 -e CA_RAG_CONFIG=$CA_RAG_CONFIG_FILE_IN_CONTAINER \
 -e VIA_DEV_API=1 \
 -v via-hf-cache:/tmp/huggingface \
 -v $CA_RAG_CONFIG_FILE_ON_HOST:$CA_RAG_CONFIG_FILE_IN_CONTAINER \
 nvcr.io/metropolis/via-dp/via-engine:2.0-dp

Note that --net=“host” is added so the VIA container can access the locally deployed VILA server.

Once VIA is fully loaded, you can access the VIA WebUI and summarize videos with a locally deployed VILA model!

Notes

Memory Usage

The peak memory usage was measured while summarizing a 2 minute video with VIA and locally deployed VILA models on an A6000 (48GB VRAM). Use these memory measurements to help decide which VILA model to use based on your System’s VRAM.

Deployment Peak VRAM Usage
VIA + VILA1.5-3B 12.6 GB
VIA + VILA1.5-8B 22.5 GB
VIA + VILA1.5-13B 37.85 GB
VIA + VILA1.5-40B OOM

VLM Comparison

To get an idea of how the summarization quality is affected by choice of VLM, the summarization results of GPT4o, VITA-2.0, VILA1.5-3B, VILA1.5-8B and VILA1.5-13B are shown below.

Each was used to summarize a traffic video with the same input prompt. The notable events in the traffic video used as input is a car driving in the wrong direction and a woman stepping out of her car into traffic.

Prompt:
caption: “Summarize this traffic crossing video, focusing on unsafe or risk events such as vehicles going in the wrong direction which could cause an accident.”

caption_summarization: “You should summarize the following events of a traffic crossing video in the format start_time:end_time:caption. If during a time segment only regular activities happen, then ignore them, else note any irregular activities like vehicles going in the wrong opposite direction in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don’t return anything else except the bullet points.”

summary_aggregation: “You are an intelligent traffic monitoring system trying to detect vehicles that are going in the wrong direction. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. The output should only contain bullet points.”

Results:

GPT4o

Here are the aggregated captions in the format start_time:end_time:event_description:
• 15.0:16.0: Person exits a car in the middle of traffic.
• 17.099:19.099: Person walks away from the car, standing in front of it, posing a hazard to oncoming traffic.
• 20.099:39.099: Person stands in front of a white car in the left lane, walks towards the left side of the road, and interacts with the driver, posing a risk to traffic flow.
• 40.099:45.103: Person walks in the middle of the road between lanes of traffic, posing a potential hazard to themselves and the vehicles.

VITA-2.0

Here are the aggregated captions in the format start_time:end_time:event_description:
• 0.1:2.7:Wrong-way vehicle detected
• 2.8:9.8:Multiple wrong-way vehicles detected
• 10.1:12.5:Wrong-way vehicle detected in traffic
• 30.2:32.8:Wrong-way vehicle and large truck detected
• 32.8:40.0:Accident involving wrong-way vehicle and large truck

VILA1.5-3B

• 10.0:10.0: Vehicle traveling in wrong direction

VILA1.5-8B

Here are the aggregated captions in the format start_time:end_time:event_description:
• 0.00:10.00: Vehicles going in the wrong direction.
• 10.00:30.00: Vehicles going in the wrong direction, risking an accident.
• 40.00:45.60: Vehicles going in the wrong direction, person standing in the middle of the road, dangerous situation.

VILA1.5-13B

Here are the aggregated captions in the format start_time:end_time:event_description:
• 10.00:40.00: Vehicle traveling in wrong direction
• 50.00:70.80: Vehicle traveling in wrong direction, almost causing accidents

In general, the larger the model the higher quality output. VILA1.5-13B is a good alternative to VITA-2.0 and GPT4o. VILA1.5-40B is also a great option for high quality output but does require significant compute. The 3B and 8B models may hallucinate or miss some details in the video.

1 Like