After customizing for my local computer, the notebook stops at the start of section 4 Run TAO training .
The train command is
!tao unet train --gpus=1 --gpu_index=$GPU_INDEX
-e $SPECS_DIR/unet_train_resnet_unet_isbi.txt
-r $USER_EXPERIMENT_DIR/isbi_experiment_unpruned
-m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5
-n model_isbi
-k $KEY
The error is:
For multi-GPU, change --gpus based on your machine.
2022-01-05 00:36:47,003 [INFO] root: Registry: ['nvcr.io']
2022-01-05 00:36:47,115 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-01-05 00:36:47,146 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/david/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/checkpoint_saver_hook.py:21: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.WARN is deprecated. Please use tf.compat.v1.logging.WARN instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py:410: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
Loading experiment spec at /workspace/tao-experiments/unet/specs/unet_train_resnet_unet_isbi.txt.
2022-01-04 22:36:54,113 [INFO] __main__: Loading experiment spec at /workspace/tao-experiments/unet/specs/unet_train_resnet_unet_isbi.txt.
2022-01-04 22:36:54,115 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /workspace/tao-experiments/unet/specs/unet_train_resnet_unet_isbi.txt
2022-01-04 22:36:54,117 [INFO] root: Initializing the pre-trained weights from /home/david/Envs/env1/cv_samples_v1.3.0/unet/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 424, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 418, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 309, in run_experiment
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/model/utilities.py", line 298, in get_pretrained_ckpt
File "/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py", line 417, in load_model
f = h5dict(filepath, 'r')
File "/usr/local/lib/python3.6/dist-packages/keras/utils/io_utils.py", line 186, in __init__
self.data = h5py.File(path, mode=mode)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 312, in __init__
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 142, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = '/home/david/Envs/env1/cv_samples_v1.3.0/unet/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2022-01-05 00:36:55,132 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
File ‘/home/david/Envs/env1/cv_samples_v1.3.0/unet/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5’ exists
The environment variables are
env: KEY=nvidia_tlt
env: GPU_INDEX=0
env: USER_EXPERIMENT_DIR=/home/david/Envs/env1/cv_samples_v1.3.0/unet
env: DATA_DOWNLOAD_DIR=/home/david/Envs/env1/cv_samples_v1.3.0/AAData/data
env: NOTEBOOK_ROOT=/home/david/Envs/env1/cv_samples_v1.3.0/unet
env: LOCAL_PROJECT_DIR=/home/david/Envs/env1/cv_samples_v1.3.0/
env: PROJECT_DIR=/home/david/Envs/env1/cv_samples_v1.3.0/deps
env: SPECS_DIR=/workspace/tao-experiments/unet/specs
total 4
drwxrwxrwx 4 david david 4096 Jan 4 13:57 isbi
The tao_mounts.json file is as originally generated by the notebook earlier by
drive_map = {
"Mounts": [
# Mapping the data directory
{
"source": os.environ["LOCAL_PROJECT_DIR"],
"destination": "/workspace/tao-experiments"
},
# Mapping the specs directory.
{
"source": os.environ["LOCAL_SPECS_DIR"],
"destination": os.environ["SPECS_DIR"]
},
]
}
resulting in
{
"Mounts": [
{
"source": "/home/david/Envs/env1/cv_samples_v1.3.0/",
"destination": "/workspace/tao-experiments"
},
{
"source": "/home/david/Envs/env1/cv_samples_v1.3.0/unet/specs",
"destination": "/workspace/tao-experiments/unet/specs"
}
]
}
Notice how “destination”: "/workspace/tao-experiments"
is HARDCODED
Could not really understand what this is for from the documentation: From this doc page… Section 2…
"… The Mounts
parameter defines the paths in the local machine, that should be mapped to the docker. This is a list of json
dictionaries containing the source
path in the local machine and the destination
path that is mapped for the TAO Toolkit commands.
What is the destination path again?? Is there any documentation on the docker folders that need mapping ?
But that is a side point. The real question is what is going on here.
I gave all permissions using sudo chmod -R 777 .
at the TAO examples root directory and same result.
I added
"DockerOptions": {
"user": "1000:1000"
}
to tao_mounts.json
and that gave another error
PermissionError: [Errno 13] Permission denied: '/home/david/Envs/env1/cv_samples_v1.3.0/unet'
2022-01-04 22:59:40,485 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
I run the unet train command from the terminal, to the exact same result.
Any guidance with this problem is very appreciated, as well some clarification and details on the mounts.