TAO Toolkit API crashes because of permissions

Hello all,

We are running the TAO Toolkit API on an EKS cluster running T4’s. The EKS cluster was created and deployed using the instructions here: Setup

Periodically the “tao-toolkit-api-app-pod” will go into crashloop and emits a ton of logs about permissions like these:

chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/2_20190607_22-16-50-20190607_22-22-23_2640.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/0_20190522_18-11-10-20190522_18-15-18_14175.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/videos_192.168.60.4_4_21_0.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/ptzimage-10.169.222.11-2018-12-05t16-47-42.753237z_1.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/data_192.168.70.5_1_260_0.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/20190522_17-11-10-20190522_17-15-15_14525_1.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/ptzimage-10.169.240.161-2018-11-19t15-20-40.699514z_1.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/videos_192.168.60.4_3_1911_0.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/vids_192.168.70.5_2_130_0.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/2251-2019-12-20T08:05:07.872774102Z_0.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/0_20190522_17-31-10-20190522_17-35-12_11300.jpg': Operation not permitted
chmod: changing permissions of '/shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9/datasets/d42d0632-c8bf-4d81-9fd4-12e7c199ce0b/images/ptzimage-10.169.222.191-2018-11-28t01-52-55.586978z_0.jpg': Operation not permitted

Please try:
sudo chown -R root:root /shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9
sudo chmod -R 777 /shared/users/34c37fbc-98e3-5cf6-8668-d5e2b1a5bca9

Please run above commands locally instead of inside container.

@Morganh Thanks for your response but that will not fix my issue. When the pod starts up initially after an issue it attempts to change permissions by itself.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Can you share more logs as following commands?
$ kubectl get pods
$ kubectl get pod -n nvidia-gpu-operator
$ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-*****
$ kubectl pod get -n gpu-operator-operator nvidia-cuda-validator-****

Note: **** depends on actual name.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.