Pipeline cannot connect to TensorRT server

I have built a custom container that processes data and sends it to TensorRT server for inference. When I run the containers individually it works fine. But when I try to create a custom pipeline it fails to connect to TensorRT server

My pipeline:

api-version: 0.3.0
name: mri-custom-pipeline
operators:

mri-cardiac-segmentation operator

  • name: mri-cardiac
    description: Segmentation of cardiac MRI images
    container:
    image: inference_client_dicom
    tag: Dyad
    input:

    • path: /input
      output:
    • path: /output
      name: segmentation

    services:

    • name: trtis

    TensorRT Inference Server, required by this AI application.

    container:
    image: nvcr.io/nvidia/tensorrtserver
    tag: 19.10-py3
    command: [“trtserver”, “–model-store=/models”]
    requests:
    gpu: 1

    services::connections defines how the TRTIS service is expected to

    be accessed. Clara Platform supports network (“http”) and

    volume (“file”) connections.

    connections:
    http:
    # The name of the connection is used to populate an environment
    # variable inside the operator’s container during execution.
    # This AI application inside the container needs to read this variable to
    # know the IP and port of TRTIS in order to connect to the service.
    - name: TRTISURI
    port: 8000
    # Some services need specialized or minimal set of hardware. In this case
    # NVIDIA Tensor RT Inference Server [TRTIS] requires at least one GPU to function.

Note :
I also tried “–model-store=$(NVIDIA_CLARA_SERVICE_DATA_PATH)/models”

Error ->

INFO:main:Program started.

  • Trying 10.109.143.215:8000…
  • TCP_NODELAY set
  • connect to 10.109.143.215 port 8000 failed: Connection refused
  • Failed to connect to 10.109.143.215 port 8000: Connection refused
  • Closing connection 0
    TRTIS URI : 10.109.143.215:8000
    RuntimeEnv Input : /input
    RuntimeEnv Output : /output
    RuntimeEnv Logs : /logs
    [ 0] HTTP client failed: Couldn’t connect to server
    Error! [ 0] HTTP client failed: Couldn’t connect to server
    Press enter to exit (and fix the problem)
    Traceback (most recent call last):
    File “inference_client_dicom/main.py”, line 94, in
    runtime_env)
    File “/app/inference_client_dicom/dyad_client.py”, line 79, in init
    self.batch_size, self.verbose)
    File “/app/inference_client_dicom/dyad_client.py”, line 302, in parse_model
    server_status = ctx.get_server_status()
    File “/root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.5/site-packages/tensorrtserver/api/init.py”, line 551, in get_server_status
    self._ctx, byref(cstatus), byref(cstatus_len))))
    File “/root/.local/share/virtualenvs/app-4PlAip0Q/lib/python3.5/site-packages/tensorrtserver/api/init.py”, line 238, in _raise_if_error
    raise ex
    tensorrtserver.api.InferenceServerException: [ 0] HTTP client failed: Couldn’t connect to server
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
    File “inference_client_dicom/main.py”, line 105, in
    input()
    EOFError: EOF when reading a line

The inference_client_dicom gets the IP address but then the server is not found there.
Please help

Thanks for the question.

First, noticed the TensorRT Inference Server container version is specified as “tag: 19.10-py3”. Clara Deploy R4 package contains TRTIS container version 19.08. Please make sure the specific container version is in local repo or can be pulled. Given that the custom containers run fine when not using the pipeline, I would think the container image is present.

Second, the model store path needs to be “–model-store=$(NVIDIA_CLARA_SERVICE_DATA_PATH)/models” in the pipeline definition.

Third, the custom model, in fact the whole folder of the custom model, must be copied over to the model store folder named above. The model store physical path by default is /clara/common/models.

Forth, the following are some useful command to check the status of TensorRT Inference Server when and after running a pipeline that has TRTIS dependency.

<i>kubectl get pods | grep trtis</i> 

This will show the TRTIS pod ID and status

Kubectl get svc | grep trtis

This will show the TRTIS server IP and Port

If the above command show the running TRTIS IP and the default port of 8000, then query the server status using this command,
curl http://:8000/api/status

Best Regards

Hi haohis66,

In case curl http://<TRTIS IP>:8000/api/status (as mentioned above by mingmelvinq) doesn’t show any meaningful information (if it is hanging or connection doesn’t succeed), the followings are other things you can check:

1) Check if DNS is working inside kubernetes container

If there are some errors during executing the following commands, please let us know with the output.

$ kubectl run -it --rm test-network --image=ubuntu:18.04 --restart=Never – bash

root@test-network:/# apt-get update

Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [21.8 kB]

root@test-network:/# apt-get install -y curl

Reading package lists… Done
Building dependency tree
Reading state information… Done
The following additional packages will be installed:

root@test-network:/# curl http://www.google.com

<!doctype html><meta content=

root@test-network:/# cat /etc/resolv.conf

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

root@test-network:/# exit
exit
pod “test-network” deleted

2) Check who is managing DNS (Making sure dnsmasq is not installed)

Please check /etc/resolv.conf

$ ls -al /etc/resolv.conf

lrwxrwxrwx 1 root root 32 Aug 26 15:21 /etc/resolv.conf -> /run/systemd/resolve/resolv.conf

$ cat /etc/resolv.conf

nameserver xxx.xxx.xxx.xxx
nameserver xxx.xxx.xxx.xxx

$ cat /etc/NetworkManager/NetworkManager.conf

[i][main]
plugins=ifupdown,keyfile

[ifupdown]
managed=false

[device]
wifi.scan-rand-mac-address=no[/i]

If /etc/NetworkManager/NetworkManager.conf has a statement ‘dns=dnsmasq’ in it or if the output of the command ‘sudo ss -pln sport = 53’ shows the term dnsmasq, it means that DNS service is controlled by dnsmasq.
If it’s the case, please remove ‘dns=dnsmasq’ in the file and restart the network.

</b> <i>sudo service network-manager restart</i> <b> sudo service networking restart

3) Check if Operators cannot access the internet or Clara Service IPs due to the institution-wide proxy server settings

If the internet connection is provided through an HTTP Proxy server, docker containers cannot access the internet while building docker images or running containers.
Even if proxies for Docker is set up properly, Clara Operators cannot access other Kubernetes services such as TRTIS because Kubernetes is using a specific service [CIDR](https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing) (default to ‘10.96.0.0/12’) and Clara Deploy is set up to use ‘10.244.0.0/16’ as a pod network CIDR of the Kubernetes node (defined in <i>install-prereqs.sh</i> script). Docker needs to be configured to not use proxies for those IP addresses.

To address those issues, you need to add/update <i>proxies</i> key in the <i>~/.docker/config.json</i> file (if the file doesn’t exist, create it) like below (assuming that proxy server’s address is <i>http://proxy.xxxx.edu:8080</i>) for Docker to have proper proxy settings (See https://docs.docker.com/network/proxy/for the detailed information):

{
    "proxies": {
        "default":
        {
            "httpProxy": "http://proxy.xxxx.edu:8080",
            "httpsProxy": "http://proxy.xxxx.edu:8080",
            "noProxy": "127.0.0.1,10.96.0.0/12,10.244.0.0/16"
        }
    }
}