Docker Compose Deployment Issue with VSS providing VILA 1.5 40B VLM Model - Out of Memory Error on EC2 g6e.48xlarge

Hardware Platform: 8x NVIDIA L40S GPUs (EC2 g6e.48xlarge instance)

System Memory: 1536 GB RAM (EC2 g6e.48xlarge specification)

Ubuntu Version: Ubuntu 24.04 (Deep Learning Base OSS NVIDIA Driver GPU AMI)

NVIDIA GPU Driver Version: Driver Version: 570.172.08; CUDA Version: 12.8

Issue Type: Bugs

How to reproduce the issue: I’m encountering out-of-memory errors when attempting to deploy the VILA 1.5 40B VLM model using Docker Compose of VSS on the recommended hardware configuration. The deployment fails during model loading/initialization.

Command used:

docker-compose up

Docker Compose File:

compose.yaml_nvidiaForums.txt (3.3 KB)

Error Logs:

VSS_40bModel_ErrorLogs.txt (81.8 KB)

Additional Context:

Following the official deployment documentation for VSS - Deploy Using Docker Compose — Video Search and Summarization Agent

Using the exact recommended EC2 instance type (g6e.48xlarge) with 8x L40S GPUs

AMI: Deep Learning Base OSS NVIDIA Driver GPU AMI (Ubuntu 24.04)

The error occurs consistently during model deployment

Questions:

Is there a known workaround for this out-of-memory issue on the recommended hardware?

Are there specific Docker Compose configuration parameters that need adjustment for the 40B model?

Should we consider different memory optimization strategies or model sharding configurations?

I have detailed logs and system configuration documentation available. Please let me know if you need any additional information to help diagnose and resolve this deployment issue.

Files Information Consists:

Complete docker-compose.yml file

Full error logs and system configuration document

Output of nvidia-smi and system resource information

Have you modified the “export NVIDIA_VISIBLE_DEVICES=0,1,2” in the .env file?

Yes, i kept as 0 and also number of GPU’s like 0,1,2,3,4….. also as part of the testing

The meaning of export NVIDIA_VISIBLE_DEVICES=0,1,2 is that you are using GPU0, GPU1 and GPU2. If you are using L40S, you can try to set the parameters in the .env file like below.

#Set VLM to NVILA
export VLM_MODEL_TO_USE=nvila
export MODEL_PATH=ngc:nvidia/tao/nvila-highres:nvila-lite-15b-highres-lita

#Adjust misc configs if needed
export DISABLE_GUARDRAILS=false
export NVIDIA_VISIBLE_DEVICES=0,1,2 #For L40S Deployment

.env_issue_topic.txt (1.2 KB)

We have used ngc key and different combinations also but unable to deploy the 40b model, But 15b model deployment we have done earlier itself. We have experienced the 35b nim container model also earlier and seen some performance gap in between vila api responses which is providing by the nvidia. So we wanted to try this 40b model which was providing as part of the VSS blueprint only.

Or is there any other way to access this 40b model or not ?

we’ve utilized 8L40S based instances also but seen OOM issues checked with both heml chart and docker compose deployment methods. I ‘ve shared those earlier itself which consists of error logs and system details too…

Could you attach your Deployment Topology? Theoretically, three L40s should be sufficient for the VLM. If you have depolyed the LLM on the 0,1,2,3 GPUs, you can set the NVIDIA_VISIBLE_DEVICES=4,5,6.

Hi @yuweiw,

Thanks for the suggestion. I need to clarify our deployment approach and share additional details:

Our Current Setup:

  • We’re using a hybrid approach with docker compose method
  • LLM & Embedding: Using API calls (not local models) - so no GPU allocation needed for these
  • VLM only: Trying to deploy the 40B model locally using available GPUs
  • In our .env file, we only see one NVIDIA_VISIBLE_DEVICES parameter since we’re not running LLM locally

Current .env configuration:

  • Using the .env file I tagged earlier in the conversation
  • All GPUs should be available for VLM since LLM/embedding are via API calls

Persistent Issues:

  • Still encountering OOM errors even with 8x L40S setup
  • Tried both helm chart and docker compose deployment methods
  • 15B model works fine, but 40B model consistently fails

Additional Files Sharing:
I’m attaching:

  1. overrides.yaml file with our current configuratio and Error logs showing the OOM issues we’re experiencing when we tried with helm chart deployment.
  2. config.yaml used in the approach of the docker compose method.

Could you please review these files and suggest if there are any memory optimization parameters or configuration modifications we can make to successfully deploy the 40B model?

Is there something specific in our hybrid approach that might be causing these memory issues?

Thanks for your continued support!

helm chart deployment process.docx (286.5 KB)

config.yaml_docker compose method.txt (3.1 KB)

You can try to set the VLM_BATCH_SIZE to 1 in your override file.
Additionally, I did not see any error messages in your helm chart file.

Also, could you just try to use the engine file directly. You can refer to our vss-configuration-vila-engine-ngc-resource to learn how to use that.