Nvidia ACE Examples

Hello,

I am having some troubles getting the ACE agent sample bots working. I have tried many of them, but get stuck at the same point. For this one, I’d like to walk through the spanish weather bot

PRE-REQUISITES

  1. NVIDIA Riva AWS AMI: g5.2xlarge, aws-marketplace/NVIDIA GPU Cloud VMI RIVA 2024.05.1 x86_64-prod-qkv7bhoohtguy
  1. Hardware - GPU (A10G, 32GM Vram, 15GB swap, 256GB EBS storage)
$ df -h
Filesystem       Size  Used Avail Use% Mounted on
/dev/root        248G  139G  110G  56% /
devtmpfs          16G     0   16G   0% /dev
tmpfs             16G  2.1M   16G   1% /dev/shm
tmpfs            3.2G  1.3M  3.1G   1% /run
tmpfs            5.0M     0  5.0M   0% /run/lock
tmpfs             16G     0   16G   0% /sys/fs/cgroup
/dev/loop0        26M   26M     0 100% /snap/amazon-ssm-agent/5656
/dev/nvme0n1p15  105M  5.2M  100M   5% /boot/efi
/dev/loop1        26M   26M     0 100% /snap/amazon-ssm-agent/7993
/dev/loop2        68M   68M     0 100% /snap/lxd/22753
/dev/loop3        56M   56M     0 100% /snap/core18/2829
/dev/loop4        92M   92M     0 100% /snap/lxd/29619
/dev/loop5        39M   39M     0 100% /snap/snapd/21759
/dev/loop6        64M   64M     0 100% /snap/core20/2318
/dev/loop8        62M   62M     0 100% /snap/core20/1587
/dev/loop7        56M   56M     0 100% /snap/core18/2538
tmpfs            3.2G  4.0K  3.2G   1% /run/user/1000

$ free -g
              total        used        free      shared  buff/cache   available
Mem:             31           2           7           0          21          28
Swap:            15           0          15
  1. Development Set Up
# correct driver
$ nvidia-smi
Mon Aug 12 13:14:01 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   32C    P0             61W /  300W |    8500MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     13363      C   tritonserver                                 8492MiB |
+-----------------------------------------------------------------------------------------+

# docker login
$ docker login nvcr.io -u \$oauthtoken
Password: 
WARNING! Your password will be stored unencrypted in /home/ubuntu/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credential-stores

Login Succeeded

# ngc 
ngc config set

<<added and verified with pulling a simple resource>>

# ufc setup
$ ucf_app_builder_cli -h
usage: ucf_app_builder_cli [-h] [-v] [-va]  ...

positional arguments:
  
    app               Perform actions on apps
    service           Perform actions on a microservice
    registry          Perform actions on registry

options:
  -h, --help          show this help message and exit
  -v, --version       Print MS Builder Version
  -va, --version-all  Print all versions





How to reproduce the issue ?

SET UP

  • set my environment variables as instructed
ubuntu@ip-172-31-45-8:~$ export NGC_CLI_API_KEY=<<my key cli key>>
ubuntu@ip-172-31-45-8:~$ export NVIDIA_API_KEY=<<my personal key>
ubuntu@ip-172-31-45-8:~$ export WEATHERSTACK_API_KEY=<<my api key>
ubuntu@ip-172-31-45-8:~$ export OPENAI_API_KEY=<<my api key>
ubuntu@ip-172-31-45-8:~$ export BOT_PATH=./samples/spanish_bot_nmt

ISSUE

  • none of the sample bots will work, and they all fail on the same command
$ docker compose -f deploy/docker/docker-compose.yml up model-utils
[+] Running 1/1
 ✔ Container model-utils  Recreated                                                                           0.2s 
Attaching to model-utils
model-utils  | 2024-08-12 13:25:04,507 [INFO] Stopping and Removing existing Riva Speech Server ...
model-utils  | 2024-08-12 13:25:05,635 [INFO] Stopping and Removing existing NLP Triton Server ...
model-utils  | Error response from daemon: No such container: nlp_triton
model-utils  | 2024-08-12 13:25:05,653 [INFO] Getting models from model config /home/ubuntu/ACE/microservices/ace_agent/samples/spanish_bot_nmt/model_config.yaml
model-utils  | 2024-08-12 13:25:05,654 [INFO] Skipping Speech models for deployment
model-utils  | 2024-08-12 13:25:05,655 [INFO] Downloading the NGC model nvidia/riva/rmir_megatronnmt_any_en_500m:2.15.0
model-utils  | 2024-08-12 13:25:08,995 [INFO] Found exisiting downloaded model for nvidia/riva/rmir_megatronnmt_any_en_500m:2.15.0
model-utils  | 2024-08-12 13:25:08,996 [INFO] Successfully downloaded the NGC model nvidia/riva/rmir_megatronnmt_any_en_500m:2.15.0 at /home/ubuntu/ACE/microservices/ace_agent/.cache/bot_maker/nvidia_riva_rmir_megatronnmt_any_en_500m_2.15.0/rmir_megatronnmt_any_en_500m_v2.15.0
model-utils  | 2024-08-12 13:25:08,997 [INFO] Downloading the NGC model nvidia/riva/rmir_megatronnmt_en_any_500m:2.15.0
model-utils  | 2024-08-12 13:25:11,521 [INFO] Found exisiting downloaded model for nvidia/riva/rmir_megatronnmt_en_any_500m:2.15.0
model-utils  | 2024-08-12 13:25:11,521 [INFO] Successfully downloaded the NGC model nvidia/riva/rmir_megatronnmt_en_any_500m:2.15.0 at /home/ubuntu/ACE/microservices/ace_agent/.cache/bot_maker/nvidia_riva_rmir_megatronnmt_en_any_500m_2.15.0/rmir_megatronnmt_en_any_500m_v2.15.0
model-utils  | 2024-08-12 13:25:11,528 [INFO] Using cached Triton Model plans for RMIR model /home/ubuntu/ACE/microservices/ace_agent/.cache/bot_maker/nvidia_riva_rmir_megatronnmt_any_en_500m_2.15.0/rmir_megatronnmt_any_en_500m_v2.15.0/rmir_megatronnmt_any_en_500m.rmir
model-utils  | 2024-08-12 13:25:21,550 [INFO] Using cached Triton Model plans for RMIR model /home/ubuntu/ACE/microservices/ace_agent/.cache/bot_maker/nvidia_riva_rmir_megatronnmt_en_any_500m_2.15.0/rmir_megatronnmt_en_any_500m_v2.15.0/rmir_megatronnmt_en_any_500m.rmirress to push and pull images from Docker Hub. If you don't have a Docker ID,
model-utils  | 2024-08-12 13:25:24,307 [INFO] Deploying Riva Skills model repository /home/ubuntu/ACE/microservices/ace_agent/model_repositoryssword or a Personal Access Token (PAT). Using a limited-scope PAT grants better securit
model-utils  | 2024-08-12 13:25:24,308 [INFO] Starting TRITON & RIVA API server..com/go/access-tokens/
model-utils  | Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.name: ^C
model-utils  | Waiting for Riva server to load all models...retrying in 10 seconds.sh
model-utils  | Riva server is ready...
model-utils  | 2024-08-12 13:26:10,010 [INFO] Successfully deployed Riva Speech Server
model-utils  | 2024-08-12 13:26:10,016 [INFO] Riva Speech Server deployed models :
model-utils  | --------------------------------------------------------------------------------
model-utils  | | MODEL NAME                                                        | VERSION  |
model-utils  | --------------------------------------------------------------------------------
model-utils  | | megatronnmt_any_en_500m                                           |     1    |
model-utils  | | megatronnmt_any_en_500m-classifier                                |     1    |
model-utils  | | megatronnmt_any_en_500m-decoder                                   |     1    |
model-utils  | | megatronnmt_any_en_500m-encoder                                   |     1    |
model-utils  | | megatronnmt_en_any_500m                                           |     1    |
model-utils  | | megatronnmt_en_any_500m-classifier                                |     1    |
model-utils  | | megatronnmt_en_any_500m-decoder                                   |     1    |
model-utils  | | megatronnmt_en_any_500m-encoder                                   |     1    |
model-utils  | --------------------------------------------------------------------------------
model-utils  | 
model-utils  | 2024-08-12 13:26:10,016 [INFO] No models found for deployment with Triton Server
model-utils  | 2024-08-12 13:26:10,017 [WARNING] Triton Server is not up, unable to list the models.
model-utils exited with code 0

I’ve looked over the docs many times, and cannot find out why. Thanks in advance for the help

Have you solved it?

Unfortunately no.