NGC API key for TAO - login works, token suddenly fails to authenticate

, ,

(Crossposting from TAO forums based on moderator advice).

Hello,

I’ve been using the TAO API deployed to AWS EKS for a while, using an NGC API key. I’m seeing some strange server-side behavior where the TAO login endpoint works but all subsequent endpoints (experiments, datasets, etc) are failing.

For example:

GET https://tao-api.local/tao-api/api/v1/users/b835bea2-c296-5e58-8958-5995ae3c3a3a/datasets 500 (Internal Server Error)
After authenticating with the /api/v1/login endpoint, I get 500 (Internal Server Error) for all other fetch() API calls - GET, POST, etc.

Based on the logs, it seems the user_id and token get returned from /login endpoint just fine – but then the token fails to validate in TAO’s server-side code.

When I replace the NGC API key with a new fresh one, everything works fine.

My questions:

  • Why does this happen? I don’t believe we set any expiration when creating the key, simply visited https://org.ngc.nvidia.com/setup/api-key and hit “Generate API key” .
  • The ‘stale’ API key had some important user data attached to it – is there any way to “revive” that API key & keep using it for auth?

Also I didn’t change anything in my authentication / login code, this issue began suddenly on Friday while testing various app endpoints.

Full network request, for reference:

Request: https://tao-api.local/tao-api/api/v1/users/b835bea2-c296-5e58-8958-5995ae3c3a3a/datasets

Request headers:
:authority: tao-api.local
:method: GET
:path: /tao-api/api/v1/users/b835bea2-c296-5e58-8958-5995ae3c3a3a/datasets
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br, zstd
accept-language: en-US,en;q=0.9
content-type: application/json
origin: http://localhost:5173
priority: u=1, i
referer: http://localhost:5173/
sec-ch-ua: "Google Chrome";v="129", "Not=A?Brand";v="8", "Chromium";v="129"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36

Response headers:
access-control-allow-credentials: true
access-control-allow-headers: DNT,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization
access-control-allow-methods: GET, PUT, POST, DELETE, PATCH, OPTIONS
access-control-allow-origin: *
access-control-max-age: 3600
content-length: 572
content-type: text/html
date: Sun, 20 Oct 2024 22:48:24 GMT
strict-transport-security: max-age=31536000; includeSubDomains

Response status: 500 

Here are kubectl logs from the API pod:

10.150.2.80 - - [21/Oct/2024:19:28:35 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 185 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:35 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 184 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:45 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 185 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:45 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 184 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:55 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 184 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:28:55 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 185 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:29:05 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 184 "-" "kube-probe/1.29"
10.150.2.80 - - [21/Oct/2024:19:29:05 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 185 "-" "kube-probe/1.29"
URL: https://tao-api.local/tao-api/api/v1/users/b835bea2-c296-5e58-8958-5995ae3c3a3a/datasets
Method: GET
Token: ...qCxuQ5COPQ
New session for user: b835bea2-c296-5e58-8958-5995ae3c3a3a
[2024-10-21 19:29:13,191] ERROR in app: Exception on /api/v1/auth [GET]
Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/api/app.py", line 85, in decorated_function
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/api/app.py", line 447, in auth
    user_id, err = authentication.validate(token)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/api/auth_utils/authentication.py", line 71, in validate
    session.set(user_id, token, extra_user_metadata)
  File "/opt/api/auth_utils/session.py", line 33, in _wrap
    return wrapped(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/api/auth_utils/session.py", line 46, in set
    tmp_session = json.load(infile)
                  ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 293, in load
    return loads(fp.read(),
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 5434 column 2 (char 1117511)

Thank you!!