Hello,
I am having some troubles getting the ACE agent sample bots working. I have tried many of them, but get stuck at the same point. For this one, I’d like to walk through the spanish weather bot
PRE-REQUISITES
- NVIDIA Riva AWS AMI: g5.2xlarge, aws-marketplace/NVIDIA GPU Cloud VMI RIVA 2024.05.1 x86_64-prod-qkv7bhoohtguy
- view here: https://aws.amazon.com/marketplace/pp/prodview-w5fa7gipuabbg
- to make things simple, i have enabled ALL outbound ports
- I tested a simple gradio web app to make sure I could reach it in my web browser without issues
- Hardware - GPU (A10G, 32GM Vram, 15GB swap, 256GB EBS storage)
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 248G 139G 110G 56% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 2.1M 16G 1% /dev/shm
tmpfs 3.2G 1.3M 3.1G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/loop0 26M 26M 0 100% /snap/amazon-ssm-agent/5656
/dev/nvme0n1p15 105M 5.2M 100M 5% /boot/efi
/dev/loop1 26M 26M 0 100% /snap/amazon-ssm-agent/7993
/dev/loop2 68M 68M 0 100% /snap/lxd/22753
/dev/loop3 56M 56M 0 100% /snap/core18/2829
/dev/loop4 92M 92M 0 100% /snap/lxd/29619
/dev/loop5 39M 39M 0 100% /snap/snapd/21759
/dev/loop6 64M 64M 0 100% /snap/core20/2318
/dev/loop8 62M 62M 0 100% /snap/core20/1587
/dev/loop7 56M 56M 0 100% /snap/core18/2538
tmpfs 3.2G 4.0K 3.2G 1% /run/user/1000
$ free -g
total used free shared buff/cache available
Mem: 31 2 7 0 21 28
Swap: 15 0 15
- Development Set Up
# correct driver
$ nvidia-smi
Mon Aug 12 13:14:01 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 32C P0 61W / 300W | 8500MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 13363 C tritonserver 8492MiB |
+-----------------------------------------------------------------------------------------+
# docker login
$ docker login nvcr.io -u \$oauthtoken
Password:
WARNING! Your password will be stored unencrypted in /home/ubuntu/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credential-stores
Login Succeeded
# ngc
ngc config set
<<added and verified with pulling a simple resource>>
# ufc setup
$ ucf_app_builder_cli -h
usage: ucf_app_builder_cli [-h] [-v] [-va] ...
positional arguments:
app Perform actions on apps
service Perform actions on a microservice
registry Perform actions on registry
options:
-h, --help show this help message and exit
-v, --version Print MS Builder Version
-va, --version-all Print all versions
How to reproduce the issue ?
SET UP
- set my environment variables as instructed
ubuntu@ip-172-31-45-8:~$ export NGC_CLI_API_KEY=<<my key cli key>>
ubuntu@ip-172-31-45-8:~$ export NVIDIA_API_KEY=<<my personal key>
ubuntu@ip-172-31-45-8:~$ export WEATHERSTACK_API_KEY=<<my api key>
ubuntu@ip-172-31-45-8:~$ export OPENAI_API_KEY=<<my api key>
ubuntu@ip-172-31-45-8:~$ export BOT_PATH=./samples/spanish_bot_nmt
ISSUE
- none of the sample bots will work, and they all fail on the same command
$ docker compose -f deploy/docker/docker-compose.yml up model-utils
[+] Running 1/1
✔ Container model-utils Recreated 0.2s
Attaching to model-utils
model-utils | 2024-08-12 13:25:04,507 [INFO] Stopping and Removing existing Riva Speech Server ...
model-utils | 2024-08-12 13:25:05,635 [INFO] Stopping and Removing existing NLP Triton Server ...
model-utils | Error response from daemon: No such container: nlp_triton
model-utils | 2024-08-12 13:25:05,653 [INFO] Getting models from model config /home/ubuntu/ACE/microservices/ace_agent/samples/spanish_bot_nmt/model_config.yaml
model-utils | 2024-08-12 13:25:05,654 [INFO] Skipping Speech models for deployment
model-utils | 2024-08-12 13:25:05,655 [INFO] Downloading the NGC model nvidia/riva/rmir_megatronnmt_any_en_500m:2.15.0
model-utils | 2024-08-12 13:25:08,995 [INFO] Found exisiting downloaded model for nvidia/riva/rmir_megatronnmt_any_en_500m:2.15.0
model-utils | 2024-08-12 13:25:08,996 [INFO] Successfully downloaded the NGC model nvidia/riva/rmir_megatronnmt_any_en_500m:2.15.0 at /home/ubuntu/ACE/microservices/ace_agent/.cache/bot_maker/nvidia_riva_rmir_megatronnmt_any_en_500m_2.15.0/rmir_megatronnmt_any_en_500m_v2.15.0
model-utils | 2024-08-12 13:25:08,997 [INFO] Downloading the NGC model nvidia/riva/rmir_megatronnmt_en_any_500m:2.15.0
model-utils | 2024-08-12 13:25:11,521 [INFO] Found exisiting downloaded model for nvidia/riva/rmir_megatronnmt_en_any_500m:2.15.0
model-utils | 2024-08-12 13:25:11,521 [INFO] Successfully downloaded the NGC model nvidia/riva/rmir_megatronnmt_en_any_500m:2.15.0 at /home/ubuntu/ACE/microservices/ace_agent/.cache/bot_maker/nvidia_riva_rmir_megatronnmt_en_any_500m_2.15.0/rmir_megatronnmt_en_any_500m_v2.15.0
model-utils | 2024-08-12 13:25:11,528 [INFO] Using cached Triton Model plans for RMIR model /home/ubuntu/ACE/microservices/ace_agent/.cache/bot_maker/nvidia_riva_rmir_megatronnmt_any_en_500m_2.15.0/rmir_megatronnmt_any_en_500m_v2.15.0/rmir_megatronnmt_any_en_500m.rmir
model-utils | 2024-08-12 13:25:21,550 [INFO] Using cached Triton Model plans for RMIR model /home/ubuntu/ACE/microservices/ace_agent/.cache/bot_maker/nvidia_riva_rmir_megatronnmt_en_any_500m_2.15.0/rmir_megatronnmt_en_any_500m_v2.15.0/rmir_megatronnmt_en_any_500m.rmirress to push and pull images from Docker Hub. If you don't have a Docker ID,
model-utils | 2024-08-12 13:25:24,307 [INFO] Deploying Riva Skills model repository /home/ubuntu/ACE/microservices/ace_agent/model_repositoryssword or a Personal Access Token (PAT). Using a limited-scope PAT grants better securit
model-utils | 2024-08-12 13:25:24,308 [INFO] Starting TRITON & RIVA API server..com/go/access-tokens/
model-utils | Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.name: ^C
model-utils | Waiting for Riva server to load all models...retrying in 10 seconds.sh
model-utils | Riva server is ready...
model-utils | 2024-08-12 13:26:10,010 [INFO] Successfully deployed Riva Speech Server
model-utils | 2024-08-12 13:26:10,016 [INFO] Riva Speech Server deployed models :
model-utils | --------------------------------------------------------------------------------
model-utils | | MODEL NAME | VERSION |
model-utils | --------------------------------------------------------------------------------
model-utils | | megatronnmt_any_en_500m | 1 |
model-utils | | megatronnmt_any_en_500m-classifier | 1 |
model-utils | | megatronnmt_any_en_500m-decoder | 1 |
model-utils | | megatronnmt_any_en_500m-encoder | 1 |
model-utils | | megatronnmt_en_any_500m | 1 |
model-utils | | megatronnmt_en_any_500m-classifier | 1 |
model-utils | | megatronnmt_en_any_500m-decoder | 1 |
model-utils | | megatronnmt_en_any_500m-encoder | 1 |
model-utils | --------------------------------------------------------------------------------
model-utils |
model-utils | 2024-08-12 13:26:10,016 [INFO] No models found for deployment with Triton Server
model-utils | 2024-08-12 13:26:10,017 [WARNING] Triton Server is not up, unable to list the models.
model-utils exited with code 0
I’ve looked over the docs many times, and cannot find out why. Thanks in advance for the help