TAO API - bare metal install - Connection Refused after TAO API re-install

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) AMD64, Ubuntu 20.04, TAO 4.0.2 Bare Metal API install
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I installed TAO API bare metal install and everything was working (with adjustments for known connection issue). After discovering a data problem and needing my GPU, uninstalled TAO API to resolve issues. Have since reinstalled TAO API bare metal. The (second) install went fine (note 127.0.0.2):

PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************************************************************************************
127.0.0.2                  : ok=25   changed=15   unreachable=0    failed=0    skipped=2    rescued=0    ignored=0   
localhost                  : ok=10   changed=5    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0   

another side (recommendation) note, add FIXME to object_detection.ipynb in the convert_index cell. I missed the need to edit class mappings and I could have avoided this problem had I configured my job correctly.

if convert_action == "convert_and_index":
    # FIXME - check class mapping
    #Change this to the classes your dataset has
    specs["target_class_mapping"] = [   {"key":"pedestrian","value":"pedestrian"},
                                        {"key":"cyclist","value":"cyclist"},
                                        {"key":"car","value":"car"}

after the install, I verified stuff:

hostname -i
127.0.1.1

kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
32080

kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-5ff6555d5d-hbbs4          1/1     Running   0          3m
nfs-subdir-external-provisioner-5f9cbb4554-9pfxl   1/1     Running   0          2m55s
nvidia-smi-5950x                                   1/1     Running   0          2m52s
tao-toolkit-api-app-pod-54c9c75fbc-brrzp           1/1     Running   0          2m50s
tao-toolkit-api-workflow-pod-55b9bfc948-dndxz      1/1     Running   0          2m50s

kubectl get services
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller   NodePort    10.107.222.63    <none>        80:32080/TCP,443:32443/TCP   3m
kubernetes                 ClusterIP   10.96.0.1        <none>        443/TCP                      6m49s
tao-toolkit-api-service    ClusterIP   10.103.203.163   <none>        8000/TCP                     2m50s

Cluster & associated services seem fine.
From my Notebook - basically the original but I changed some print statements to help me troubleshoot

Setting up Connection

(adjusted per known connection issue)

response = requests.get(f"{host_url}/api/v1/login/{ngc_api_key}")
print (f"response: {response}")
user_id = str(uuid.uuid4())
print (f"HOST generated userid: {user_id}")
token = "whatever"
print (f"token doesn't matter: {token}")

# set base URL
base_url = f"http://127.0.0.1:31951/api/v1/user/{user_id}"
headers = {"Authorization": f"Bearer {token}"}
print (f"API Calls will be forwarded to: {base_url}")
print (f"headers: {headers}")

result

request: http://127.0.1.1:32080/api/v1/login/...NjVl

response: <Response [401]>
HOST generated userid: 4b6fb64c-5a26-4aef-ad3c-650a2d8220fb
token doesn't matter: whatever

API Calls will be forwarded to:[ http://127.0.0.2:31951/api/v1/user/4b6fb64c-5a26-4aef-ad3c-650a2d8220fb]

headers: {'Authorization': 'Bearer whatever'}

First TAO API Operation

# Create train dataset
# response 201 == success!
data = json.dumps({"type":ds_type,"format":ds_format})
endpoint = f"{base_url}/dataset"

print (f"endpoint: {endpoint}")
print (f"data: {data}")
print (f"headers: {headers}")

response = requests.post(endpoint,data=data,headers=headers)
print(response)
print(response.json())
dataset_id = response.json()["id"]
print (f'dataset_id: {dataset_id}')

result

endpoint:[ http://127.0.0.2:31951/api/v1/user/4b6fb64c-5a26-4aef-ad3c-650a2d8220fb/dataset]
data: {"type": "object_detection", "format": "kitti"}
headers: {'Authorization': 'Bearer whatever'}

w/ 127.0.0.2 (since the install referenced 127.0.0.2)
ConnectionError: HTTPConnectionPool(host=‘127.0.0.2’, port=31951): Max retries exceeded with url: /api/v1/user/e8990b89-013b-42b5-8fe3-15e1f702275d/dataset (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7f5dcc1d7748>: Failed to establish a new connection: [Errno 111] Connection refused’,))

I also tried 127.0.0.1 to match original instructions; same result

I also tried in a browser:

http://127.0.0.1:31951/api/v1/user/4b6fb64c-5a26-4aef-ad3c-650a2d8220fb/dataset
http://127.0.0.2:31951/api/v1/user/4b6fb64c-5a26-4aef-ad3c-650a2d8220fb/dataset

connection refused in both cases

http://127.0.1.1:32080/api/v1/login/...NjVl
{}

empty braces (as expected, a good thing?)

So basically, my cluster appears okay but I can’t connect. What did I miss? Thanks for all of your help.

This is expected which is mentioned in Tao Toolkit API cannot login and got 401 unauthorized.

For

Could you double check the steps mentioned in above workaround topic?

Thanks again for your help on this.
I think I may have forgotten to edit the service per the workaround after the re-install…

I should have reverified:

kubectl get services
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller   NodePort    10.111.187.213   <none>        80:32080/TCP,443:32443/TCP   75m
kubernetes                 ClusterIP   10.96.0.1        <none>        443/TCP                      79m
tao-toolkit-api-service    NodePort    10.103.208.44    <none>        8000:31951/TCP               75m

I definitely transposed two port digits. fixed and now I’m attached correctly

I also should have noted for the next person,
When I reinstalled (bash setup.sh install), the cosole referenced: 127.0.0.2
(the first install, it referenced 127.0.0.1)

consequently, I changed base_url = f’http://127.0.0.2:31951/api/v1/user/{user_id}
(not 127.0.0.1) – this worked fine

Thanks for the info. Glad to know it is working now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.