Performance Gap Between Self-Hosted VILA Model and NVIDIA VILA API - Need Parameter Configuration Guidance

cravichandra1999 · June 17, 2025, 10:21am

Issue Category: Model Performance & Configuration

Detailed Description:

Current Setup

Infrastructure: GPU-supported EC2 instances
Implementation: FastAPI wrapper on top of VILA inference command
Problem: Significant performance gap compared to NVIDIA VILA API responses

Specific Issues

Performance Discrepancy:

Self-deployed VILA models showing inferior results compared to NVIDIA VILA API
Suspect NVIDIA API may be using larger/more trained models (potentially >40B parameters)
Note: As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models

Missing Parameter Configuration:

Unable to configure inference parameters in current FastAPI implementation
Need to pass: temperature, top_p, seed values to deployed model
Current setup doesn’t support these sampling parameters

Questions for Support Team

Model Specifications:

What are the exact model parameters/versions used in NVIDIA VILA API?
Are there larger parameter models (>40B) available that aren’t in public repositories?
Is there will be any change if self hosted api build directly on top of ec2 without using the nvidia providing metropolis services and all?

Parameter Configuration:

How to properly implement temperature, top_p, and seed parameters in VILA inference?
Best practices for FastAPI wrapper configuration with these parameters

Expected Resolution

Clear documentation on implementing sampling parameters (temperature, top_p, seed)
Guidance on model selection to match NVIDIA API performance
Best practices for FastAPI deployment with proper parameter support

yuweiw · June 18, 2025, 7:18am

Could you please attach the page about “NVIDIA VILA API” you referred to? This forum mainly focuses on Video Search and Summarization Agent topics.

cravichandra1999 · June 20, 2025, 10:08am

used above as reference and made inference to my ec2 instance, where the results itself is not accurate basically and vila AP providing by nvidiaI(vila Model by NVIDIA | NVIDIA NIM) which is outperforming . So to get the same kind of responses like vila api performed what shuld we can do even we didn’t seen cosmos nemotron 34b model nim deployment resiurces directly.

Topic		Replies	Views
Building a Speech-Enabled AI Virtual Assistant with NVIDIA Riva on Amazon EC2 Technical Blog	3	662	February 5, 2025
Ussue with VIA and VITA-2.0 - Error Code 402 Visual AI Agent	9	370	November 1, 2024
Nvidia/llama-3.1-nemotron-nano-4b-v1.1 Tool Calling Issue in n8n Models jetson , nim , llama	0	10	June 19, 2025
VILA docker issue Visual AI Agent nvbugs , llama	5	141	February 10, 2025
Build Custom Enterprise-Grade Generative AI with NVIDIA AI Foundation Models Technical Blog	0	319	November 15, 2023
Vila를 사용하는 nvidia 하드웨어의 시각적 언어 모델 Technical Blog - South Korea	1	141	May 17, 2024
Issue detecting video in NVIDIA VSS Visual AI Agent	26	142	June 4, 2025
VSS Resource Exhaustion Error with nvila-15b-lite-highres-lita Model When num_video_frames Exceeds 64 Visual AI Agent nvbugs	4	25	June 18, 2025
NVIDIA AI Foundation Models: Build Custom Enterprise Chatbots and Co-Pilots with Production-Ready LLMs Technical Blog	4	588	April 12, 2024
New VILA-1.5 multimodal vision/language models released in 3B, 8B, 13B, 40B Jetson Projects generative_ai	0	1520	May 3, 2024

Performance Gap Between Self-Hosted VILA Model and NVIDIA VILA API - Need Parameter Configuration Guidance

Current Setup

Specific Issues

Questions for Support Team

Expected Resolution

Related topics