"could not process config file: ..." error while trying to deploy VSS on AWS cloud

korosh.roohi9731 · January 14, 2025, 2:40pm

Hi,

I followed the instruction regarding deploying VSS on AWS cloud nodes and came up with this config.yml file:

schema_version: '0.0.9'
name: "via-aws-cns-{{ lookup('env', 'VIA_DEPLOY_ENV') }}"
spec:
  infra:
    csp: 'aws'
    backend:
      access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
      secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
      dynamodb_table: "{{ lookup('env', 'VIA_DEPLOY_AWS_DYT') }}"
      bucket: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3B') }}"
      region: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3BR') }}"
      encrypt: true
    provider:
      access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
      secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
    configs:
      cns:
        version: 12.2
        git_ref: 4d97cb7e8ca6e45fe9252888b7a918b2677f1fc9
        override_values:
          cns_nvidia_driver: yes
          gpu_driver_version: '535.216.03'
      access_cidrs:
      - '99.79.65.21/32'
      region: 'ca-central-1'
      ssh_public_key: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
      ssh_private_key_path: "{{ lookup('env', 'HOME') + '/.ssh/id_rsa' }}"
      additional_ssh_public_keys:
      - "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/my-colleague-1.pub') }}"
      clusters:
        app:
          private_instance: false
          master:
            type: 'p5.48xlarge'
            az: 'ca-central-1c'
            labels: {}
            taints: []
            capacity_reservation_id: 'cr-3b7e4c9f1a6d8e2b'
#          nodes:
#            A100:
#              type: 'p4d.24xlarge'
#              az: 'ca-central-1'
#              labels: {}
#              taints: []
#              capacity_reservation_id: 'cr-foobar'
#            L40S:
#              type: 'g6e.48xlarge'
#              az: 'ca-central-1'
#              labels: {}
#              taints: []
          ports:
            backend:
              port: 30081
            frontend:
              port: 30082
          features:
            cns: true
            platform: true
            app: true
  platform:
    configs:
      namespace: 'default'
  app:
    configs:
      namespace: 'default'
      backend_port: 'backend'
      frontend_port: 'frontend'
      ngc_api_key: "{{ lookup('env', 'NGC_API_KEY') }}"
      openai_api_key: "{{ lookup('env', 'OPENAI_API_KEY') }}"
      db_username: 'neo4j'
      db_password: "{{ lookup('env', 'VIA_DB_PASSWORD') | default('password') }}"
      vss_chart:
        repo:
          name: 'nvidia-blueprint'
          url: 'https://helm.ngc.nvidia.com/nvidia/blueprint'
        chart: 'nvidia-blueprint-vss'
        version: '2.1.0'
#        override_values_file_absolute_path: '/home/nvidia/aws/dist/override.yaml'

All env variables are set properly, but I couldn’t find the override.yml file to uncomment the related line, so I kept it commented. I get this error every time I try to use the config to deploy the VSS, and after each time I remove the tmp folder.

yuweiw · January 15, 2025, 10:29am

You mean you can’t find this overrides.yaml file?

korosh.roohi9731 · January 17, 2025, 2:38pm

Thanks for your response.

Since I was following the AWS deploy instructions, I didn’t notice the overrides.yaml file.

I added this overrides file, but got the same error while trying the deploy the config. I kept the overrides file unchanged from what was on the link you provided.

This is the config file:

    configs:
      cns:
        version: 12.2
        git_ref: 4d97cb7e8ca6e45fe9252888b7a918b2677f1fc9
        override_values:
          cns_nvidia_driver: yes
          gpu_driver_version: '535.216.03'
      access_cidrs:
      - '99.79.65.21/32'
      region: 'ca-central-1'
      ssh_public_key: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
      ssh_private_key_path: "{{ lookup('env', 'HOME') + '/.ssh/id_rsa' }}"
      additional_ssh_public_keys:
      - "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/my-colleague-1.pub') }}"
      clusters:
        app:
          private_instance: false
          master:
            type: 'p5.48xlarge'
            az: 'ca-central-1c'
            labels: {}
            taints: []
            capacity_reservation_id: 'cr-3b7e4c9f1a6d8e2b'
          ports:
            backend:
              port: 30081
            frontend:
              port: 30082
          features:
            cns: true
            platform: true
            app: true
  platform:
    configs:
      namespace: 'default'
  app:
    configs:
      namespace: 'default'
      backend_port: 'backend'
      frontend_port: 'frontend'
      ngc_api_key: "{{ lookup('env', 'NGC_API_KEY') }}"
      openai_api_key: "{{ lookup('env', 'OPENAI_API_KEY') }}"
      db_username: 'neo4j'
      db_password: "{{ lookup('env', 'VIA_DB_PASSWORD') | default('password') }}"
      vss_chart:
        repo:
          name: 'nvidia-blueprint'
          url: 'https://helm.ngc.nvidia.com/nvidia/blueprint'
        chart: 'nvidia-blueprint-vss'
        version: '2.1.0'
        override_values_file_absolute_path: '/home/ubuntu/dist/overrides.yaml'

and this is the overrides.yaml file:

            - name: VLM_MODEL_TO_USE
              value: vila-1.5 # Or "openai-compat" or "custom"
            # Specify path in case of VILA-1.5 and custom model. Can be either
            # a NGC resource path or a local path. For custom models this
            # must be a path to the directory containing "inference.py" and
            # "manifest.yaml" files.
            - name: MODEL_PATH
              value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
            - name: DISABLE_GUARDRAILS
              value: "false" # "true" to disable guardrails.
            - name: TRT_LLM_MODE
              value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
            - name: VLM_BATCH_SIZE
              value: ""  # Default is determined based on GPU memory. (for VILA only)
            - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
              value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
            - name: VIA_VLM_ENDPOINT
              value: ""  # Default OpenAI API. Override to use a custom API
            - name: VIA_VLM_API_KEY
              value: ""  # API key to set when calling VIA_VLM_ENDPOINT
            - name: OPENAI_API_VERSION
              value: ""
            - name: AZURE_OPENAI_API_VERSION
              value: ""

  resources:
    limits:
      nvidia.com/gpu: 2   # Set to 8 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-1>

nim-llm:
  resources:
    limits:
      nvidia.com/gpu: 4
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-embedding:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-rerank:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

Topic		Replies	Views
Broken GPU state query failure in AMD + H100 Confidential Computing	10	932	February 15, 2024
How to call NVML APIs? CUDA Programming and Performance	5	17286	October 18, 2011
RmInitAdapter failed! since kernel > 6.4 Linux kernel	28	3028	November 5, 2024
PRIME option does not appear in nvidia-settings Linux ubuntu , driver	9	1875	March 20, 2024
NvDsInfer Error: NVDSINFER_CONFIG_FAILED for FasterRCNN DeepStream SDK	13	634	November 22, 2023
AWS Linux 2 - can't install Nvidia Drivers Drivers - Linux, Windows, MacOS	1	2245	October 25, 2023
Error running Nvidia VSS blueprint \|\| pods kept restating and crashing multiple times and never completed Visual AI Agent nim , llama	8	54	January 14, 2025
After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column. CUDA Setup and Installation	18	3129	June 8, 2018
Tutorial: How to run YOLOv7 on Deepstream DeepStream SDK demos-and-tutorials	19	6774	March 26, 2024
Power9 - nvidia-smi shows "unknown error" in memory column Linux	35	10217	October 14, 2021

"could not process config file: ..." error while trying to deploy VSS on AWS cloud

Related topics