Connection Refused: AutoML TAO Toolkit API

I have installed the AutoML Tao Toolkit API using the bare metal setup instructions from the documentation. I am trying to run the classification notebook from here for running AutoML experiments.

tao-getting-started_v5.2.0/notebooks/tao_api_starter_kit/api/classification.ipynb

While executing the below cell

# Exchange NGC_API_KEY for JWT
data = json.dumps({"ngc_api_key": ngc_api_key})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.status_code in (200, 201)
assert "user_id" in response.json().keys()
user_id = response.json()["user_id"]
print("User ID",user_id)
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/user/{user_id}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

I am getting the following error

---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
File /usr/lib/python3/dist-packages/urllib3/connection.py:159, in HTTPConnection._new_conn(self)
    158 try:
--> 159     conn = connection.create_connection(
    160         (self._dns_host, self.port), self.timeout, **extra_kw
    161     )
    163 except SocketTimeout:

File /usr/lib/python3/dist-packages/urllib3/util/connection.py:84, in create_connection(address, timeout, source_address, socket_options)
     83 if err is not None:
---> 84     raise err
     86 raise socket.error("getaddrinfo returns an empty list")

File /usr/lib/python3/dist-packages/urllib3/util/connection.py:74, in create_connection(address, timeout, source_address, socket_options)
     73     sock.bind(source_address)
---> 74 sock.connect(sa)
     75 return sock

ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:666, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    665 # Make the request on the httplib connection object.
--> 666 httplib_response = self._make_request(
    667     conn,
    668     method,
    669     url,
    670     timeout=timeout_obj,
    671     body=body,
    672     headers=headers,
    673     chunked=chunked,
    674 )
    676 # If we're going to release the connection in ``finally:``, then
    677 # the response doesn't need to know about the connection. Otherwise
    678 # it will also try to release it and we'll have a double-release
    679 # mess.

File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:388, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    387 else:
--> 388     conn.request(method, url, **httplib_request_kw)
    390 # Reset the timeout for the recv() on the socket

File /usr/lib/python3.8/http/client.py:1256, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
   1255 """Send a complete request to the server."""
-> 1256 self._send_request(method, url, body, headers, encode_chunked)

File /usr/lib/python3.8/http/client.py:1302, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
   1301     body = _encode(body, 'body')
-> 1302 self.endheaders(body, encode_chunked=encode_chunked)

File /usr/lib/python3.8/http/client.py:1251, in HTTPConnection.endheaders(self, message_body, encode_chunked)
   1250     raise CannotSendHeader()
-> 1251 self._send_output(message_body, encode_chunked=encode_chunked)

File /usr/lib/python3.8/http/client.py:1011, in HTTPConnection._send_output(self, message_body, encode_chunked)
   1010 del self._buffer[:]
-> 1011 self.send(msg)
   1013 if message_body is not None:
   1014 
   1015     # create a consistent interface to message_body

File /usr/lib/python3.8/http/client.py:951, in HTTPConnection.send(self, data)
    950 if self.auto_open:
--> 951     self.connect()
    952 else:

File /usr/lib/python3/dist-packages/urllib3/connection.py:187, in HTTPConnection.connect(self)
    186 def connect(self):
--> 187     conn = self._new_conn()
    188     self._prepare_conn(conn)

File /usr/lib/python3/dist-packages/urllib3/connection.py:171, in HTTPConnection._new_conn(self)
    170 except SocketError as e:
--> 171     raise NewConnectionError(
    172         self, "Failed to establish a new connection: %s" % e
    173     )
    175 return conn

NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f90643c28e0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
File /usr/local/lib/python3.8/dist-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:720, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    718     e = ProtocolError("Connection aborted.", e)
--> 720 retries = retries.increment(
    721     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    722 )
    723 retries.sleep()

File /usr/lib/python3/dist-packages/urllib3/util/retry.py:436, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    435 if new_retry.is_exhausted():
--> 436     raise MaxRetryError(_pool, url, error or ResponseError(cause))
    438 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPConnectionPool(host='172.x.x.x', port=32080): Max retries exceeded with url: /api/v1/login (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f90643c28e0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
Cell In[6], line 3
      1 # Exchange NGC_API_KEY for JWT
      2 data = json.dumps({"ngc_api_key": ngc_api_key})
----> 3 response = requests.post(f"{host_url}/api/v1/login", data=data)
      4 assert response.status_code in (200, 201)
      5 assert "user_id" in response.json().keys()

File /usr/local/lib/python3.8/dist-packages/requests/api.py:115, in post(url, data, json, **kwargs)
    103 def post(url, data=None, json=None, **kwargs):
    104     r"""Sends a POST request.
    105 
    106     :param url: URL for the new :class:`Request` object.
   (...)
    112     :rtype: requests.Response
    113     """
--> 115     return request("post", url, data=data, json=json, **kwargs)

File /usr/local/lib/python3.8/dist-packages/requests/api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File /usr/local/lib/python3.8/dist-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File /usr/local/lib/python3.8/dist-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File /usr/local/lib/python3.8/dist-packages/requests/adapters.py:519, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    515     if isinstance(e.reason, _SSLError):
    516         # This branch is for urllib3 v1.22 and later.
    517         raise SSLError(e, request=request)
--> 519     raise ConnectionError(e, request=request)
    521 except ClosedPoolError as e:
    522     raise ConnectionError(e, request=request)

ConnectionError: HTTPConnectionPool(host='172.x.x.x', port=32080): Max retries exceeded with url: /api/v1/login (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f90643c28e0>: Failed to establish a new connection: [Errno 111] Connection refused'))

I tried to run the below command by following this thread

$ kubectl edit services tao-toolkit-api-service

but i get the below error

Unable to connect to the server: dial tcp 172.x.x.x:6443: i/o timeout

Would really appreciate any help regarding this issue. Thanks.

Did you setup TAO API environment successfully?

Yes @Morganh. I did setup the environment successfully (bare metal server) and was able to train models using the AutoML TAO API sometime back. I turned off the instance after completing the experiments. Now I turned on the instance again and tried to use the same notebook for training but facing the connection refused issue. After turning on the instance, should we run any commands to get the API services up ?

Thanks for the info. So it was successful in previous run.

Please rerun some cells which are in the beginning of notebook.

Yes, I am running the cells in the notebook from the beginning and getting the Connection Refused error at the below cell.

# Exchange NGC_API_KEY for JWT
data = json.dumps({"ngc_api_key": ngc_api_key})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.status_code in (200, 201)
assert "user_id" in response.json().keys()
user_id = response.json()["user_id"]
print("User ID",user_id)
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/user/{user_id}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

Can you share the full log?
Also, could you try to terminate the running notebook and trigger notebook again to check if you can reproduce the 1st successful run?