Hi,
Currently, I’m using Kubernetes for accessing Clara Train SDK v2. I run the image using this command:
kubectl run -n syifa clara-v2 -it --tty --image=nvcr.io/nvidia/clara-train-sdk:v2.0 start_aas.sh
The cmd shows this
If you don’t see a command prompt, try pressing enter.
NOTE: Legacy NVIDIA Driver detected. Compatibility mode ENABLED.
NOTE: Detected MOFED driver 4.4-2.0.7; version automatically updated.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TensorFlow. NVIDIA recommends the use of the following flags:
nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 …/opt/nvidia/medical/aiaa-launch-config.json
ENGINE:: engine=TRTIS
TRTIS:: Backend is enabledTRTIS:: trtis_ip=localhost
TRTIS:: Will setup TRTIS Server on localhostTRTIS:: trtis_http_port=8000
TRTIS:: trtis_grpc_port=8001
TRTIS:: trtis_metrics_port=8002
TRTIS:: trtis_proto=grpc
TRTIS:: trtis_model_path=/var/nvidia/aiaa/trtis_models
TRTIS:: trtis_verbose=false
TRTIS:: trtis_log=/var/nvidia/aiaa/logs/host-80/trtis.log
TRTIS:: trtis_start_timeout=120
TRTIS:: trtis_model_timeout=30TRTIS:: Waiting 1 seconds to fully up…
TRTIS:: Server started with pid: 103AIAA:: aiaa_port=80
AIAA:: aiaa_log_file=/var/nvidia/aiaa/logs/host-80/aiaa.log
AIAA:: aiaa_log_dir=/var/nvidia/aiaa/logs/host-80
AIAA:: aiaa_workspace=/var/nvidia/aiaa
AIAA:: aiaa_ssl=false
AIAA:: aiaa_ssl_cert_file=/etc/ssl/certs/ssl-cert-snakeoil.pem
AIAA:: aiaa_ssl_pkey_file=/etc/ssl/private/ssl-cert-snakeoil.keyStopping Apache httpd web server apache2
Site 000-default disabled.
To activate the new configuration, you need to run:
service apache2 reload
Enabling site AIAA.
To activate the new configuration, you need to run:
service apache2 reload
Starting AIAA Server…
AH00558: apache2: Could not reliably determine the server’s fully qualified domain name, using 10.233.116.248. Set the ‘ServerName’ directive globally to suppress this message
but after I do port-forwarding using this command
kubectl port-forward clara-v2-64f4dd4c4-r8jh2 5000:5000 -n syifa
When I accessed the localhost:5000, it said localhost didn’t send any data
The error on cmd is shown below.
E0413 14:59:28.775394 21411 portforward.go:400] an error occurred forwarding 5000 -> 5000: error forwarding port 5000 to pod d04aa4f526d6bb6a44f9640a7088459df3bb1195d431ffd16e6520072fab987e, uid : exit status 1: 2020/04/13 14:59:28 socat[16300] E connect(6, AF=2 127.0.0.1:5000, 16): Connection refused
I try to exit the pods and resume the pods but the AIAA server often stopped before it started. I’ve never experienced this problem with Clara Train SDK v1.0.
Can you help me with my problem? Thank you so much!