Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) AMD64, Ubuntu 20.04, TAO 4.0.2 Bare Metal API install
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I installed TAO API bare metal install and everything was working (with adjustments for known connection issue). After discovering a data problem and needing my GPU, uninstalled TAO API to resolve issues. Have since reinstalled TAO API bare metal. The (second) install went fine (note 127.0.0.2):
PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************************************************************************************
127.0.0.2 : ok=25 changed=15 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
localhost : ok=10 changed=5 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
another side (recommendation) note, add FIXME to object_detection.ipynb in the convert_index cell. I missed the need to edit class mappings and I could have avoided this problem had I configured my job correctly.
if convert_action == "convert_and_index":
# FIXME - check class mapping
#Change this to the classes your dataset has
specs["target_class_mapping"] = [ {"key":"pedestrian","value":"pedestrian"},
{"key":"cyclist","value":"cyclist"},
{"key":"car","value":"car"}
after the install, I verified stuff:
hostname -i
127.0.1.1
kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
32080
kubectl get pods
NAME READY STATUS RESTARTS AGE
ingress-nginx-controller-5ff6555d5d-hbbs4 1/1 Running 0 3m
nfs-subdir-external-provisioner-5f9cbb4554-9pfxl 1/1 Running 0 2m55s
nvidia-smi-5950x 1/1 Running 0 2m52s
tao-toolkit-api-app-pod-54c9c75fbc-brrzp 1/1 Running 0 2m50s
tao-toolkit-api-workflow-pod-55b9bfc948-dndxz 1/1 Running 0 2m50s
kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller NodePort 10.107.222.63 <none> 80:32080/TCP,443:32443/TCP 3m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 6m49s
tao-toolkit-api-service ClusterIP 10.103.203.163 <none> 8000/TCP 2m50s
Cluster & associated services seem fine.
From my Notebook - basically the original but I changed some print statements to help me troubleshoot
Setting up Connection
(adjusted per known connection issue)
response = requests.get(f"{host_url}/api/v1/login/{ngc_api_key}")
print (f"response: {response}")
user_id = str(uuid.uuid4())
print (f"HOST generated userid: {user_id}")
token = "whatever"
print (f"token doesn't matter: {token}")
# set base URL
base_url = f"http://127.0.0.1:31951/api/v1/user/{user_id}"
headers = {"Authorization": f"Bearer {token}"}
print (f"API Calls will be forwarded to: {base_url}")
print (f"headers: {headers}")
result
request: http://127.0.1.1:32080/api/v1/login/...NjVl
response: <Response [401]>
HOST generated userid: 4b6fb64c-5a26-4aef-ad3c-650a2d8220fb
token doesn't matter: whatever
API Calls will be forwarded to:[ http://127.0.0.2:31951/api/v1/user/4b6fb64c-5a26-4aef-ad3c-650a2d8220fb]
headers: {'Authorization': 'Bearer whatever'}
First TAO API Operation
# Create train dataset
# response 201 == success!
data = json.dumps({"type":ds_type,"format":ds_format})
endpoint = f"{base_url}/dataset"
print (f"endpoint: {endpoint}")
print (f"data: {data}")
print (f"headers: {headers}")
response = requests.post(endpoint,data=data,headers=headers)
print(response)
print(response.json())
dataset_id = response.json()["id"]
print (f'dataset_id: {dataset_id}')
result
endpoint:[ http://127.0.0.2:31951/api/v1/user/4b6fb64c-5a26-4aef-ad3c-650a2d8220fb/dataset]
data: {"type": "object_detection", "format": "kitti"}
headers: {'Authorization': 'Bearer whatever'}
w/ 127.0.0.2 (since the install referenced 127.0.0.2)
ConnectionError: HTTPConnectionPool(host=‘127.0.0.2’, port=31951): Max retries exceeded with url: /api/v1/user/e8990b89-013b-42b5-8fe3-15e1f702275d/dataset (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7f5dcc1d7748>: Failed to establish a new connection: [Errno 111] Connection refused’,))
I also tried 127.0.0.1 to match original instructions; same result
I also tried in a browser:
http://127.0.0.1:31951/api/v1/user/4b6fb64c-5a26-4aef-ad3c-650a2d8220fb/dataset
http://127.0.0.2:31951/api/v1/user/4b6fb64c-5a26-4aef-ad3c-650a2d8220fb/dataset
connection refused in both cases
http://127.0.1.1:32080/api/v1/login/...NjVl
{}
empty braces (as expected, a good thing?)
So basically, my cluster appears okay but I can’t connect. What did I miss? Thanks for all of your help.