I’ve been using the TAO API deployed to AWS EKS for a while, and now am seeing some strange server-side behavior where the login endpoint works but all subsequent (experiments, datasets, etc) are failing.
For example:
GET https://tao-api.local/tao-api/api/v1/users/b835bea2-c296-5e58-8958-5995ae3c3a3a/datasets 500 (Internal Server Error)
After authenticating with the /api/v1/login
endpoint, I get 500 (Internal Server Error)
for all other fetch() API calls - GET, POST, etc.
Based on the logs, it seems the user_id and token get returned from /login
endpoint just fine – but then the token fails to validate in TAO’s server-side code.
When I replace the API key with a new fresh one, everything works fine.
My questions:
- Why does this happen? I don’t believe we set any expiration when creating the key, simply visited https://org.ngc.nvidia.com/setup/api-key and hit “Generate API key” .
- The ‘stale’ API key had some important user data attached to it – is there any way to “revive” that API key & keep using it for auth?
Also I didn’t change anything in my authentication / login code, this issue began suddenly on Friday while testing various app endpoints.
Full network request, for reference:
Request: https://tao-api.local/tao-api/api/v1/users/b835bea2-c296-5e58-8958-5995ae3c3a3a/datasets
Request headers:
:authority: tao-api.local
:method: GET
:path: /tao-api/api/v1/users/b835bea2-c296-5e58-8958-5995ae3c3a3a/datasets
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br, zstd
accept-language: en-US,en;q=0.9
content-type: application/json
origin: http://localhost:5173
priority: u=1, i
referer: http://localhost:5173/
sec-ch-ua: "Google Chrome";v="129", "Not=A?Brand";v="8", "Chromium";v="129"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36
Response headers:
access-control-allow-credentials: true
access-control-allow-headers: DNT,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization
access-control-allow-methods: GET, PUT, POST, DELETE, PATCH, OPTIONS
access-control-allow-origin: *
access-control-max-age: 3600
content-length: 572
content-type: text/html
date: Sun, 20 Oct 2024 22:48:24 GMT
strict-transport-security: max-age=31536000; includeSubDomains
Response status: 500
Here are kubectl logs from the API pod:
10.150.2.80 - - [21/Oct/2024:19:28:35 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 185 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:35 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 184 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:45 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 185 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:45 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 184 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:55 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 184 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:55 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 185 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:29:05 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 184 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:29:05 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 185 "-" "kube-probe/1.29"
URL: https://tao-api.local/tao-api/api/v1/users/b835bea2-c296-5e58-8958-5995ae3c3a3a/datasets
Method: GET
Token: ...qCxuQ5COPQ
New session for user: b835bea2-c296-5e58-8958-5995ae3c3a3a
[2024-10-21 19:29:13,191] ERROR in app: Exception on /api/v1/auth [GET]
Traceback (most recent call last):
File "/venv/lib/python3.11/site-packages/flask/app.py", line 2190, in wsgi_app
response = self.full_dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/flask/app.py", line 1486, in full_dispatch_request
rv = self.handle_user_exception(e)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/api/app.py", line 85, in decorated_function
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/api/app.py", line 447, in auth
user_id, err = authentication.validate(token)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/api/auth_utils/authentication.py", line 71, in validate
session.set(user_id, token, extra_user_metadata)
File "/opt/api/auth_utils/session.py", line 33, in _wrap
return wrapped(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/api/auth_utils/session.py", line 46, in set
tmp_session = json.load(infile)
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/__init__.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 5434 column 2 (char 1117511)
Thank you!!