Unet_isbi notebook fails at the train instruction

After customizing for my local computer, the notebook stops at the start of section 4 Run TAO training .

The train command is

!tao unet train --gpus=1 --gpu_index=$GPU_INDEX
-e $SPECS_DIR/unet_train_resnet_unet_isbi.txt
-r $USER_EXPERIMENT_DIR/isbi_experiment_unpruned
-m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5
-n model_isbi
-k $KEY

The error is:

For multi-GPU, change --gpus based on your machine.
2022-01-05 00:36:47,003 [INFO] root: Registry: ['nvcr.io']
2022-01-05 00:36:47,115 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-01-05 00:36:47,146 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/david/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/checkpoint_saver_hook.py:21: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.WARN is deprecated. Please use tf.compat.v1.logging.WARN instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py:410: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

Loading experiment spec at /workspace/tao-experiments/unet/specs/unet_train_resnet_unet_isbi.txt.
2022-01-04 22:36:54,113 [INFO] __main__: Loading experiment spec at /workspace/tao-experiments/unet/specs/unet_train_resnet_unet_isbi.txt.
2022-01-04 22:36:54,115 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /workspace/tao-experiments/unet/specs/unet_train_resnet_unet_isbi.txt
2022-01-04 22:36:54,117 [INFO] root: Initializing the pre-trained weights from /home/david/Envs/env1/cv_samples_v1.3.0/unet/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 424, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 418, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 309, in run_experiment
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/model/utilities.py", line 298, in get_pretrained_ckpt
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py", line 417, in load_model
    f = h5dict(filepath, 'r')
  File "/usr/local/lib/python3.6/dist-packages/keras/utils/io_utils.py", line 186, in __init__
    self.data = h5py.File(path, mode=mode)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = '/home/david/Envs/env1/cv_samples_v1.3.0/unet/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2022-01-05 00:36:55,132 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

File ‘/home/david/Envs/env1/cv_samples_v1.3.0/unet/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5’ exists

The environment variables are

env: KEY=nvidia_tlt
env: GPU_INDEX=0
env: USER_EXPERIMENT_DIR=/home/david/Envs/env1/cv_samples_v1.3.0/unet
env: DATA_DOWNLOAD_DIR=/home/david/Envs/env1/cv_samples_v1.3.0/AAData/data
env: NOTEBOOK_ROOT=/home/david/Envs/env1/cv_samples_v1.3.0/unet
env: LOCAL_PROJECT_DIR=/home/david/Envs/env1/cv_samples_v1.3.0/
env: PROJECT_DIR=/home/david/Envs/env1/cv_samples_v1.3.0/deps
env: SPECS_DIR=/workspace/tao-experiments/unet/specs
total 4
drwxrwxrwx 4 david david 4096 Jan  4 13:57 isbi

The tao_mounts.json file is as originally generated by the notebook earlier by

drive_map = {
    "Mounts": [
        # Mapping the data directory
        {
            "source": os.environ["LOCAL_PROJECT_DIR"],
            "destination": "/workspace/tao-experiments"
        },
        # Mapping the specs directory.
        {
            "source": os.environ["LOCAL_SPECS_DIR"],
            "destination": os.environ["SPECS_DIR"]
        },
    ]
}

resulting in

{
    "Mounts": [
        {
            "source": "/home/david/Envs/env1/cv_samples_v1.3.0/",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/david/Envs/env1/cv_samples_v1.3.0/unet/specs",
            "destination": "/workspace/tao-experiments/unet/specs"
        }
    ]
}

Notice how “destination”: "/workspace/tao-experiments" is HARDCODED

Could not really understand what this is for from the documentation: From this doc page… Section 2…

"… The Mounts parameter defines the paths in the local machine, that should be mapped to the docker. This is a list of json dictionaries containing the source path in the local machine and the destination path that is mapped for the TAO Toolkit commands.

What is the destination path again?? Is there any documentation on the docker folders that need mapping ?

But that is a side point. The real question is what is going on here.

I gave all permissions using sudo chmod -R 777 . at the TAO examples root directory and same result.

I added

"DockerOptions": {
        "user": "1000:1000"
    }

to tao_mounts.json and that gave another error

PermissionError: [Errno 13] Permission denied: '/home/david/Envs/env1/cv_samples_v1.3.0/unet'
2022-01-04 22:59:40,485 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I run the unet train command from the terminal, to the exact same result.

Any guidance with this problem is very appreciated, as well some clarification and details on the mounts.

Please check above link in your training spec file. In the spec file, the path should be inside the docker.

See your tao_mounts.json,

        "source": "/home/david/Envs/env1/cv_samples_v1.3.0/",
        "destination": "/workspace/tao-experiments"

you already map your local “source” to the destination folder of the docker.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.