Hello everyone, we are running the Tao Toolkit Api deployed into an EKS cluster using the script provided by NVIDIA in this tutorial. Setup - NVIDIA Docs
When the tao-toolkit-api-app-pod restarts for any reason, it attempts to change permission for some reason and fails, which causes all operations of the Tao Toolkit Api to fail. There is no other output and I cannot find any information in the docs as to why this is happening. This is becoming a real problem because the only way I can fix it is to literally uninstall the helm chart, wipe the PVC (losing all the current data) and then reinstall using helm.
I was able to exec into the pod and check the permissions and they are seem correct.
root@tao-toolkit-api-app-pod-59b6f6c6cc-kq557:/opt/api# ls -l /shared
total 8
-rw-rw-rw- 1 nobody nogroup 422 Mar 30 12:20 health.txt
drwxrwxrwx 4 nobody nogroup 4096 Mar 28 13:59 users
We are running an EKS cluster running on g4dn ec2 instance types, which means they are running on Tesla T4s. The rest of the cluster works just fine and there are no issues.
Here is the beginning of the logs, but it goes on for a while and then acts as if it is normal.
NGC CLI 3.10.0
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/c6786a1c-a733-499f-bb00-95667a557003/pretrained_detectnet_v2_vgooglenet': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/c6786a1c-a733-499f-bb00-95667a557003/pretrained_detectnet_v2_vgooglenet/googlenet.hdf5': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/62f47074-2cf8-4931-b66f-582c4d72cf37/lpdnet_vunpruned_v2.1': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/62f47074-2cf8-4931-b66f-582c4d72cf37/lpdnet_vunpruned_v2.1/yolov4_tiny_ccpd_trainable.tlt': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/62f47074-2cf8-4931-b66f-582c4d72cf37/lpdnet_vunpruned_v2.1/yolov4_tiny_usa_trainable.tlt': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/7aa692c7-2d47-43b8-aa9a-3120993e1300/pretrained_object_detection_vdarknet53': Operation not permitted
chmod: changing permissions of '/shared/users/00000000-0000-0000-0000-000000000000/models/7aa692c7-2d47-43b8-aa9a-3120993e1300/pretrained_object_detection_vdarknet53/darknet_53.hdf5': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/logs': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/jobs.yaml': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/specs': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/specs/convert.json': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/jobs_metadata': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_2293_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_1396_0_augmented_164210584358369_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_556_0_augmented_224296389824903_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_382_0_augmented_164210584358369_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_541_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_1740_0_augmented_164210584358369_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_313_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_1354_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_2144_0_augmented_251934180809220_.txt': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/19ef1578-12d4-4463-9b59-2432bd8bb666/labels/training_data_1473_0_augmented_251934180809220_.txt': Operation not permitted
...
Then, of course, when my ML team tries to make requests they get 500 errors.
[2023-03-30 12:19:52,518] ERROR in app: Exception on /api/v1/user/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/model [GET]
Traceback (most recent call last):
File "/venv/lib/python3.11/site-packages/flask/app.py", line 2525, in wsgi_app
response = self.full_dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/flask/app.py", line 1822, in full_dispatch_request
rv = self.handle_user_exception(e)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/flask/app.py", line 1820, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/flask/app.py", line 1796, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/api/app.py", line 1424, in model_list
response = make_response(jsonify(schema.dump(schema.load(metadata))['models']))
^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/marshmallow/schema.py", line 722, in load
return self._do_load(
^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/marshmallow/schema.py", line 909, in _do_load
raise exc
marshmallow.exceptions.ValidationError: {'models': {6: {'train_datasets': ['Not a valid list.']}, 7: {'train_datasets': ['Not a valid list.']}, 8: {'train_datasets': ['Not a valid list.']}}}