"could not process config file: ..." error while trying to deploy VSS on AWS cloud

Hi,

I followed the instruction regarding deploying VSS on AWS cloud nodes and came up with this config.yml file:

schema_version: '0.0.9'
name: "via-aws-cns-{{ lookup('env', 'VIA_DEPLOY_ENV') }}"
spec:
  infra:
    csp: 'aws'
    backend:
      access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
      secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
      dynamodb_table: "{{ lookup('env', 'VIA_DEPLOY_AWS_DYT') }}"
      bucket: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3B') }}"
      region: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3BR') }}"
      encrypt: true
    provider:
      access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
      secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
    configs:
      cns:
        version: 12.2
        git_ref: 4d97cb7e8ca6e45fe9252888b7a918b2677f1fc9
        override_values:
          cns_nvidia_driver: yes
          gpu_driver_version: '535.216.03'
      access_cidrs:
      - '99.79.65.21/32'
      region: 'ca-central-1'
      ssh_public_key: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
      ssh_private_key_path: "{{ lookup('env', 'HOME') + '/.ssh/id_rsa' }}"
      additional_ssh_public_keys:
      - "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/my-colleague-1.pub') }}"
      clusters:
        app:
          private_instance: false
          master:
            type: 'p5.48xlarge'
            az: 'ca-central-1c'
            labels: {}
            taints: []
            capacity_reservation_id: 'cr-3b7e4c9f1a6d8e2b'
#          nodes:
#            A100:
#              type: 'p4d.24xlarge'
#              az: 'ca-central-1'
#              labels: {}
#              taints: []
#              capacity_reservation_id: 'cr-foobar'
#            L40S:
#              type: 'g6e.48xlarge'
#              az: 'ca-central-1'
#              labels: {}
#              taints: []
          ports:
            backend:
              port: 30081
            frontend:
              port: 30082
          features:
            cns: true
            platform: true
            app: true
  platform:
    configs:
      namespace: 'default'
  app:
    configs:
      namespace: 'default'
      backend_port: 'backend'
      frontend_port: 'frontend'
      ngc_api_key: "{{ lookup('env', 'NGC_API_KEY') }}"
      openai_api_key: "{{ lookup('env', 'OPENAI_API_KEY') }}"
      db_username: 'neo4j'
      db_password: "{{ lookup('env', 'VIA_DB_PASSWORD') | default('password') }}"
      vss_chart:
        repo:
          name: 'nvidia-blueprint'
          url: 'https://helm.ngc.nvidia.com/nvidia/blueprint'
        chart: 'nvidia-blueprint-vss'
        version: '2.1.0'
#        override_values_file_absolute_path: '/home/nvidia/aws/dist/override.yaml'

All env variables are set properly, but I couldn’t find the override.yml file to uncomment the related line, so I kept it commented. I get this error every time I try to use the config to deploy the VSS, and after each time I remove the tmp folder.

You mean you can’t find this overrides.yaml file?

Thanks for your response.

Since I was following the AWS deploy instructions, I didn’t notice the overrides.yaml file.

I added this overrides file, but got the same error while trying the deploy the config. I kept the overrides file unchanged from what was on the link you provided.

This is the config file:

    configs:
      cns:
        version: 12.2
        git_ref: 4d97cb7e8ca6e45fe9252888b7a918b2677f1fc9
        override_values:
          cns_nvidia_driver: yes
          gpu_driver_version: '535.216.03'
      access_cidrs:
      - '99.79.65.21/32'
      region: 'ca-central-1'
      ssh_public_key: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
      ssh_private_key_path: "{{ lookup('env', 'HOME') + '/.ssh/id_rsa' }}"
      additional_ssh_public_keys:
      - "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/my-colleague-1.pub') }}"
      clusters:
        app:
          private_instance: false
          master:
            type: 'p5.48xlarge'
            az: 'ca-central-1c'
            labels: {}
            taints: []
            capacity_reservation_id: 'cr-3b7e4c9f1a6d8e2b'
          ports:
            backend:
              port: 30081
            frontend:
              port: 30082
          features:
            cns: true
            platform: true
            app: true
  platform:
    configs:
      namespace: 'default'
  app:
    configs:
      namespace: 'default'
      backend_port: 'backend'
      frontend_port: 'frontend'
      ngc_api_key: "{{ lookup('env', 'NGC_API_KEY') }}"
      openai_api_key: "{{ lookup('env', 'OPENAI_API_KEY') }}"
      db_username: 'neo4j'
      db_password: "{{ lookup('env', 'VIA_DB_PASSWORD') | default('password') }}"
      vss_chart:
        repo:
          name: 'nvidia-blueprint'
          url: 'https://helm.ngc.nvidia.com/nvidia/blueprint'
        chart: 'nvidia-blueprint-vss'
        version: '2.1.0'
        override_values_file_absolute_path: '/home/ubuntu/dist/overrides.yaml'

and this is the overrides.yaml file:

            - name: VLM_MODEL_TO_USE
              value: vila-1.5 # Or "openai-compat" or "custom"
            # Specify path in case of VILA-1.5 and custom model. Can be either
            # a NGC resource path or a local path. For custom models this
            # must be a path to the directory containing "inference.py" and
            # "manifest.yaml" files.
            - name: MODEL_PATH
              value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
            - name: DISABLE_GUARDRAILS
              value: "false" # "true" to disable guardrails.
            - name: TRT_LLM_MODE
              value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
            - name: VLM_BATCH_SIZE
              value: ""  # Default is determined based on GPU memory. (for VILA only)
            - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
              value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
            - name: VIA_VLM_ENDPOINT
              value: ""  # Default OpenAI API. Override to use a custom API
            - name: VIA_VLM_API_KEY
              value: ""  # API key to set when calling VIA_VLM_ENDPOINT
            - name: OPENAI_API_VERSION
              value: ""
            - name: AZURE_OPENAI_API_VERSION
              value: ""

  resources:
    limits:
      nvidia.com/gpu: 2   # Set to 8 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-1>

nim-llm:
  resources:
    limits:
      nvidia.com/gpu: 4
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-embedding:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-rerank:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

Hi @korosh.roohi9731 , could you run the command printenv to verify if all the required environment variables are properly set?

Can you also check whether the following ssh-key is generated normally?

     ssh_public_key: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
      ssh_private_key_path: "{{ lookup('env', 'HOME') + '/.ssh/id_rsa' }}"
      additional_ssh_public_keys:
      - "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/my-colleague-1.pub') }}"

Thanks for your message. The problem in processing the config file was for one of the environment variables which is fixed now, but I faced another problem with this log:

preparing artifacts
applying TF shape
CTRL-C to abort
â•·
│ Error: error configuring S3 Backend: error validating provider credentials: error calling sts:GetCallerIdentity: InvalidClientTokenId: The security token included in the request is invalid.
│       status code: 403, request id: b9b523ef-4809-471a-9e82-c43aaab5c08f
│
│
╵

â•·
│ Error: Backend initialization required, please run "terraform init"
│
│ Reason: Initial configuration of the requested backend "s3"
│
│ The "backend" is the interface that Terraform uses to store state,
│ perform operations, etc. If this message is showing up, it means that the
│ Terraform configuration you're using is using a custom configuration for
│ the Terraform backend.
│
│ Changes to backend configurations require reinitialization. This allows
│ Terraform to set up the new configuration, copy existing state, etc. Please run
│ "terraform init" with either the "-reconfigure" or "-migrate-state" flags to
│ use the current configuration.
│
│ If the change reason above is incorrect, please verify your configuration
│ hasn't changed and try again. At this point, no changes to your existing
│ configuration or state have been made.
╵
failed to determine IaC changes

This looks like an AWS permission issue. Could your verify your access credentials for AWS? And you can also verify that if the IAM user or role associated with the provided credentials has the necessary permissions to perform sts:GetCallerIdentity and access the S3 bucket.

Which environment variable was causing the issue? I am getting the same error.

It was basically all the environment. I used to reset the host server without running “source via…” command.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.