Tao Training failing on creating directory on a standard example

Please provide the following information when requesting support.
T4:
Classification (standard example): tao_voc/classification.ipynb

Perhaps you mean : tao info --verbose. The result is:

tao info --verbose
Configuration of the TAO Toolkit Instance

dockers:
nvidia/tao/tao-toolkit-tf:
v3.22.05-tf1.15.5-py3:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification
4. dssd
5. faster_rcnn
6. emotionnet
7. efficientdet
8. fpenet
9. gazenet
10. gesturenet
11. heartratenet
12. lprnet
13. mask_rcnn
14. multitask_classification
15. retinanet
16. ssd
17. unet
18. yolo_v3
19. yolo_v4
20. yolo_v4_tiny
21. converter
v3.22.05-tf1.15.4-py3:
docker_registry: nvcr.io
tasks:
1. detectnet_v2
nvidia/tao/tao-toolkit-pyt:
v3.22.05-py3:
docker_registry: nvcr.io
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. speech_to_text_conformer
4. action_recognition
5. pointpillars
6. pose_classification
7. spectro_gen
8. vocoder
v3.21.11-py3:
docker_registry: nvcr.io
tasks:
1. text_classification
2. question_answering
3. token_classification
4. intent_slot_classification
5. punctuation_and_capitalization
nvidia/tao/tao-toolkit-lm:
v3.22.05-py3:
docker_registry: nvcr.io
tasks:
1. n_gram
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022

Standard example:
model_config {
arch: “resnet”,
n_layers: 18

Setting these parameters to true to match the template downloaded from NGC.

use_batch_norm: true
all_projections: true
freeze_blocks: 0
freeze_blocks: 1
input_image_size: “3,224,224”
}
train_config {
train_dataset_path: “/workspace/tao-experiments/data/split/train”
val_dataset_path: “/workspace/tao-experiments/data/split/val”
pretrained_model_path: “/workspace/tao-experiments/classification/pretrained_resnet18/pretrained_classification_vresnet18/resnet_18.hdf5”
optimizer {
sgd {
lr: 0.01
decay: 0.0
momentum: 0.9
nesterov: False
}
}
batch_size_per_gpu: 64
n_epochs: 80
n_workers: 16
preprocess_mode: “caffe”
enable_random_crop: True
enable_center_crop: True
label_smoothing: 0.0
mixup_alpha: 0.1

regularizer

reg_config {
type: “L2”
scope: “Conv2D,Dense”
weight_decay: 0.00005
}

learning_rate

lr_config {
step {
learning_rate: 0.006
step_size: 10
gamma: 0.1
}
}
}
eval_config {
eval_dataset_path: “/workspace/tao-experiments/data/split/test”
model_path: “/workspace/tao-experiments/classification/output/weights/resnet_080.tlt”
top_k: 3
batch_size: 256
n_workers: 8
enable_center_crop: True
}

Command: !tao classification train -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY
2022-09-05 14:02:57,197 [INFO] root: Registry: [‘nvcr.io’]
2022-09-05 14:02:57,372 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-x8l3qzai because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
Using TensorFlow backend.
2022-09-05 21:03:04,898 [INFO] main: Loading experiment spec at /data/virt/cv_samples_v1.4.1/classification/tao_voc/specs/classification_spec.cfg.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:384: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2022-09-05 21:03:04,908 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:384: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:393: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2022-09-05 21:03:04,908 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:393: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 653, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 649, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 635, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 406, in run_experiment
File “/usr/lib/python3.6/os.py”, line 210, in makedirs
makedirs(head, mode, exist_ok)
File “/usr/lib/python3.6/os.py”, line 210, in makedirs
makedirs(head, mode, exist_ok)
File “/usr/lib/python3.6/os.py”, line 210, in makedirs
makedirs(head, mode, exist_ok)
File “/usr/lib/python3.6/os.py”, line 220, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: ‘/home/user’
2022-09-05 14:03:06,526 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

.tao_mounts:
{
“Mounts”: [
{
“source”: “/home/user/virt/tr1”,
“destination”: “/workspace/tao-experiments”
},
{
“source”: “/data/virt/cv_samples_v1.4.1/classification/tao_voc/specs”,
“destination”: “/data/virt/cv_samples_v1.4.1/classification/tao_voc/specs”
}
],
“DockerOptions”: {
“user”: “1000:1000”
}
}

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

(1) docker can run without sudo
(2) "docker login nvcr.io " done
(3) ngc works

Can you run below commands in the terminal instead of notebook to check if you can mkdir a folder?
$ tao classification run /bin/bash
# cd /workspace/tao-experiments
# mkdir testfolder

Just did the above 3 steps. It created the folder

(ntao) user@station:~$ tao classification run /bin/bash

2022-09-05 20:38:38,541 [INFO] root: Registry: [‘nvcr.io’]

2022-09-05 20:38:38,700 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3

groups: cannot find name for group ID 1000

I have no name!@f72ed3915234:/workspace$ cd /workspace/tao-experiments/

I have no name!@f72ed3915234:/workspace/tao-experiments$ ls

classification data ngccli results workspace

I have no name!@f72ed3915234:/workspace/tao-experiments$ mkdir testfolder

I have no name!@f72ed3915234:/workspace/tao-experiments$ ls

classification data ngccli results testfolder workspace

I have no name!@f72ed3915234:/workspace/tao-experiments$

To narrow down, can you run training in docker again?
# classification train -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY

Just ran. Same issue. Below are the logs with env posted:
environ({‘LD_LIBRARY_PATH’: ‘/usr/local/cuda/lib64:/home/user/TensorRT-7.2.3.4/lib:/usr/local/cuda/extras/CUPTI/lib64/:/home/user/TensorRT-8.0.1.6/lib:/opt/nvidia/deepstream/deepstream-5.1/lib/’, ‘SSH_CONNECTION’: ‘192.168.86.32 55794 192.168.86.25 22’, ‘LANG’: ‘en_US.UTF-8’, ‘OLDPWD’: ‘/data/virt’, ‘VIRTUAL_ENV’: ‘/data/virt/ntao’, ‘S_COLORS’: ‘auto’, ‘XDG_SESSION_ID’: ‘4715’, ‘USER’: ‘user’, ‘PWD’: ‘/data/virt/cv_samples_v1.4.1’, ‘HOME’: ‘/home/user’, ‘SSH_CLIENT’: ‘192.168.86.32 55794 22’, ‘CUDA_HOME’: ‘/usr/local/cuda’, ‘XDG_DATA_DIRS’: ‘/usr/local/share:/usr/share:/var/lib/snapd/desktop’, ‘SSH_TTY’: ‘/dev/pts/3’, ‘MAIL’: ‘/var/mail/user’, ‘TERM’: ‘xterm-color’, ‘SHELL’: ‘/bin/bash’, ‘SHLVL’: ‘1’, ‘GST_DEBUG_DUMP_DOT_DIR’: ‘/tmp’, ‘LOGNAME’: ‘user’, ‘DBUS_SESSION_BUS_ADDRESS’: ‘unix:path=/run/user/1000/bus’, ‘XDG_RUNTIME_DIR’: ‘/run/user/1000’, ‘PATH’: ‘/data/virt/ntao/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda/bin:/snap/bin:/home/user/TensorRT-8.0.1.6/include:/home/user/TensorRT-7.2.3.4/include:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/home/user/virt/ngc-cli’, ‘PS1’: '(ntao) ${debian_chroot:+($debian_chroot)}\u@\h:\w\$ ', ‘_’: ‘/data/virt/ntao/bin/jupyter-notebook’, ‘JPY_PARENT_PID’: ‘35336’, ‘CLICOLOR’: ‘1’, ‘PAGER’: ‘cat’, ‘GIT_PAGER’: ‘cat’, ‘MPLBACKEND’: ‘module://ipykernel.pylab.backend_inline’, ‘KEY’: ‘OHExcnRlMWNyMHNrcGE2MmlzMWlndmpoNXE6OGM0MTYyZTctNGVjNy00MDhjLTg5YmYtNDE4MTNmZTczOGUz’, ‘NUM_GPUS’: ‘1’, ‘USER_EXPERIMENT_DIR’: ‘/home/user/virt/tr1’, ‘DATA_DOWNLOAD_DIR’: ‘/home/user/virt/tr1/data’, ‘LOCAL_PROJECT_DIR’: ‘/home/user/virt/tr1’, ‘LOCAL_DATA_DIR’: ‘/home/user/virt/tr1/data’, ‘LOCAL_EXPERIMENT_DIR’: ‘/home/user/virt/tr1/classification’, ‘LOCAL_SPECS_DIR’: ‘/data/virt/cv_samples_v1.4.1/classification/tao_voc/specs’, ‘SPECS_DIR’: ‘/data/virt/cv_samples_v1.4.1/classification/tao_voc/specs’})
2022-09-05 22:29:30,471 [INFO] root: Registry: [‘nvcr.io’]
2022-09-05 22:29:30,619 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-v_43hf20 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
Using TensorFlow backend.
2022-09-06 05:29:38,283 [INFO] main: Loading experiment spec at /data/virt/cv_samples_v1.4.1/classification/tao_voc/specs/classification_spec.cfg.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:384: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2022-09-06 05:29:38,294 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:384: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:393: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2022-09-06 05:29:38,294 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py:393: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 653, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 649, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 635, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py”, line 406, in run_experiment
File “/usr/lib/python3.6/os.py”, line 210, in makedirs
makedirs(head, mode, exist_ok)
File “/usr/lib/python3.6/os.py”, line 210, in makedirs
makedirs(head, mode, exist_ok)
File “/usr/lib/python3.6/os.py”, line 210, in makedirs
makedirs(head, mode, exist_ok)
File “/usr/lib/python3.6/os.py”, line 220, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: ‘/home/user’
2022-09-05 22:29:39,896 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you share the explicit command as well?

Please try to remove the following from the ~/.tao_mounts.json to check if it works.

    "DockerOptions": {
        "user": "1000:1000"

Reference: Permission Denied Error When training MASK RCNN - #12 by subhankar.halder

Yes, this is working now. Thank you very much.

I made the changes you suggested.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.