Performance Gap Between Self-Hosted VILA Model and NVIDIA VILA API - Need Parameter Configuration Guidance

Issue Category: Model Performance & Configuration

Detailed Description:

Current Setup

  • Infrastructure: GPU-supported EC2 instances
  • Implementation: FastAPI wrapper on top of VILA inference command
  • Problem: Significant performance gap compared to NVIDIA VILA API responses

Specific Issues

  1. Performance Discrepancy:
  • Self-deployed VILA models showing inferior results compared to NVIDIA VILA API
  • Suspect NVIDIA API may be using larger/more trained models (potentially >40B parameters)
  • Note: As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models
  1. Missing Parameter Configuration:
  • Unable to configure inference parameters in current FastAPI implementation
  • Need to pass: temperature, top_p, seed values to deployed model
  • Current setup doesn’t support these sampling parameters

Questions for Support Team

  1. Model Specifications:
  • What are the exact model parameters/versions used in NVIDIA VILA API?
  • Are there larger parameter models (>40B) available that aren’t in public repositories?
  • Is there will be any change if self hosted api build directly on top of ec2 without using the nvidia providing metropolis services and all?
  1. Parameter Configuration:
  • How to properly implement temperature, top_p, and seed parameters in VILA inference?
  • Best practices for FastAPI wrapper configuration with these parameters

Expected Resolution

  • Clear documentation on implementing sampling parameters (temperature, top_p, seed)
  • Guidance on model selection to match NVIDIA API performance
  • Best practices for FastAPI deployment with proper parameter support

Could you please attach the page about “NVIDIA VILA API” you referred to? This forum mainly focuses on Video Search and Summarization Agent topics.

used above as reference and made inference to my ec2 instance, where the results itself is not accurate basically and vila AP providing by nvidiaI(vila Model by NVIDIA | NVIDIA NIM) which is outperforming . So to get the same kind of responses like vila api performed what shuld we can do even we didn’t seen cosmos nemotron 34b model nim deployment resiurces directly.