Tao-toolkit-api-app-pod restarts and fails to change permissions

Hello everyone, we are running the Tao Toolkit Api deployed into an EKS cluster using the script provided by NVIDIA in this tutorial. Setup - NVIDIA Docs

When the tao-toolkit-api-app-pod restarts for any reason, it attempts to change permission for some reason and fails, which causes all operations of the Tao Toolkit Api to fail. There is no other output and I cannot find any information in the docs as to why this is happening. This is becoming a real problem because the only way I can fix it is to literally uninstall the helm chart, wipe the PVC (losing all the current data) and then reinstall using helm.

I was able to exec into the pod and check the permissions and they are seem correct.

root@tao-toolkit-api-app-pod-59b6f6c6cc-kq557:/opt/api# ls -l /shared
total 8
-rw-rw-rw- 1 nobody nogroup  422 Mar 30 12:20 health.txt
drwxrwxrwx 4 nobody nogroup 4096 Mar 28 13:59 users

We are running an EKS cluster running on g4dn ec2 instance types, which means they are running on Tesla T4s. The rest of the cluster works just fine and there are no issues.

Here is the beginning of the logs, but it goes on for a while and then acts as if it is normal.

NGC CLI 3.10.0
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/c6786a1c-a733-499f-bb00-95667a557003/pretrained_detectnet_v2_vgooglenet': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/c6786a1c-a733-499f-bb00-95667a557003/pretrained_detectnet_v2_vgooglenet/googlenet.hdf5': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/62f47074-2cf8-4931-b66f-582c4d72cf37/lpdnet_vunpruned_v2.1': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/62f47074-2cf8-4931-b66f-582c4d72cf37/lpdnet_vunpruned_v2.1/yolov4_tiny_ccpd_trainable.tlt': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/62f47074-2cf8-4931-b66f-582c4d72cf37/lpdnet_vunpruned_v2.1/yolov4_tiny_usa_trainable.tlt': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/7aa692c7-2d47-43b8-aa9a-3120993e1300/pretrained_object_detection_vdarknet53': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/7aa692c7-2d47-43b8-aa9a-3120993e1300/pretrained_object_detection_vdarknet53/darknet_53.hdf5': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/logs': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/jobs.yaml': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/specs': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/specs/convert.json': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/jobs_metadata': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_2293_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_1396_0_augmented_164210584358369_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_556_0_augmented_224296389824903_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_382_0_augmented_164210584358369_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_541_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_1740_0_augmented_164210584358369_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_313_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_1354_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_2144_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_1473_0_augmented_251934180809220_.txt': Operation not permitted
...

Then, of course, when my ML team tries to make requests they get 500 errors.

[2023-03-30 12:19:52,518] ERROR in app: Exception on /api/v1/user/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/model [GET]
Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/api/app.py", line 1424, in model_list
    response = make_response(jsonify(schema.dump(schema.load(metadata))['models']))
                                                 ^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/marshmallow/schema.py", line 722, in load
    return self._do_load(
           ^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/marshmallow/schema.py", line 909, in _do_load
    raise exc
marshmallow.exceptions.ValidationError: {'models': {6: {'train_datasets': ['Not a valid list.']}, 7: {'train_datasets': ['Not a valid list.']}, 8: {'train_datasets': ['Not a valid list.']}}}

Please change the folders mode on your host and try to restart the pod again. Please follow this.

  1. Go to nfs folder, such as
    cd /mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-db12515a-fd55-45a1-bfee-a384462b5b77/users

  2. Change the user and group of the folder and make sure sessions.yaml, all PTMs you download and all files under your user_id folder belong to root
    Sudo chown -R root:root sessions.yaml
    Sudo chown -R root:root b37c122a-c210-55eb-be86-913f5c6cc406
    Sudo chown -R root:root 00000000-0000-0000-0000-000000000000

Sudo chmod -R 777 sessions.yaml
Sudo chmod -R 777 b37c122a-c210-55eb-be86-913f5c6cc406
Sudo chmod -R 777 00000000-0000-0000-0000-000000000000

  1. Delete the crashloop pod and wait the new pod restart. It will restart several times then the pod will be ready.

One more question, you mentioned that it is normal. But you also mentioned that “get 500 errors.”. Do you mean
pod → ok
notebook – > not ok?

Thanks Morganh. I tried doing this and I can successfully change permissions but it fails to allow me change ownership of anything. Also, if this is a real solution, can I make some kind of request to add a feature to add an init container to the helm deployment, so it automatically changes permissions on each startup? As of right now, I had to delete the entire helm installation, including the PVC (all of the data) in order to start it back up for my team to use it.

To answer your question about the 500 errors: The pod starts up and automatically tries to change permissions of those files but fails, then the Flask server stands up as if everything is ok. When our users try to make requests to it (let’s say, from a jupyter notebook) it will throw 500 errors. I am assuming it is not able to access the data it needs, but don’t see any easy way to fix it using Nvidia’s current setup.

I will sync with internal team for your request. Thanks.

Thanks Morganh!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.