Issue Category: Model Performance & Configuration
Detailed Description:
Current Setup
- Infrastructure: GPU-supported EC2 instances
- Implementation: FastAPI wrapper on top of VILA inference command
- Problem: Significant performance gap compared to NVIDIA VILA API responses
Specific Issues
- Performance Discrepancy:
- Self-deployed VILA models showing inferior results compared to NVIDIA VILA API
- Suspect NVIDIA API may be using larger/more trained models (potentially >40B parameters)
- Note: As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models
- Missing Parameter Configuration:
- Unable to configure inference parameters in current FastAPI implementation
- Need to pass: temperature, top_p, seed values to deployed model
- Current setup doesn’t support these sampling parameters
Questions for Support Team
- Model Specifications:
- What are the exact model parameters/versions used in NVIDIA VILA API?
- Are there larger parameter models (>40B) available that aren’t in public repositories?
- Is there will be any change if self hosted api build directly on top of ec2 without using the nvidia providing metropolis services and all?
- Parameter Configuration:
- How to properly implement temperature, top_p, and seed parameters in VILA inference?
- Best practices for FastAPI wrapper configuration with these parameters
Expected Resolution
- Clear documentation on implementing sampling parameters (temperature, top_p, seed)
- Guidance on model selection to match NVIDIA API performance
- Best practices for FastAPI deployment with proper parameter support