TAO 5.3 Direct Container usage issues

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) H100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt Segformer
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi,

I’ve been using TAO 5.3 for some time using the Launcher. A specific use case required that I use the containers directly. I used the following command:

docker run -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/ubuntu/DBI-9/Transfer/WindowService:/workspace/tao-experiments nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt segformer train -e /workspace/tao-experiments/specs/WindowsV2/MLOps/mit_b5/512/train.yaml -r /workspace/tao-experiments/results/WindowsV2/MLOps/mit_b5/51

I highlight here that I’m using the exact same spec file and dataset that runs without issues using the launcher. So there are no issues there.

When I run the above command I get a very verbose error message (below) that points to an mmengine execption in local_backend.py. The error is:

[Errno 40] Too many levels of symbolic links: ‘/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/stderr’
Error executing job with overrides: [‘results_dir=/workspace/tao-experiments/results/WindowsV2/MLOps/mit_b5/51’]
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 119, in main
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 106, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 85, in run_experiment
runner.train()
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1728, in train
self._train_loop = self.build_train_loop(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1520, in build_train_loop
loop = LOOPS.build(
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py”, line 219, in init
super().init(runner, dataloader)
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/base_loop.py”, line 26, in init
self.dataloader = runner.build_dataloader(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1370, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmseg/datasets/basesegdataset.py”, line 142, in init
self.full_init()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 298, in full_init
self.data_list = self.load_data_list()
File “/usr/local/lib/python3.10/dist-packages/mmseg/datasets/basesegdataset.py”, line 256, in load_data_list
for img in fileio.list_dir_or_file(
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/io.py”, line 760, in list_dir_or_file
yield from backend.list_dir_or_file(dir_path, list_dir, list_file, suffix,
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/backends/local_backend.py”, line 538, in _list_dir_or_file
yield from _list_dir_or_file(entry.path, list_dir,
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/backends/local_backend.py”, line 538, in _list_dir_or_file
yield from _list_dir_or_file(entry.path, list_dir,
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/backends/local_backend.py”, line 538, in _list_dir_or_file
yield from _list_dir_or_file(entry.path, list_dir,
[Previous line repeated 37 more times]
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/backends/local_backend.py”, line 528, in _list_dir_or_file
if not entry.name.startswith(‘.’) and entry.is_file():
OSError: [Errno 40] Too many levels of symbolic links: ‘/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/stderr’

The function that causes the exception is as follows (I shelled into the docker):

def list_dir_or_file(self,
                     dir_path: Union[str, Path],
                     list_dir: bool = True,
                     list_file: bool = True,
                     suffix: Optional[Union[str, Tuple[str]]] = None,
                     recursive: bool = False) -> Iterator[str]:
    """Scan a directory to find the interested directories or files in
    arbitrary order.

    Note:
        :meth:`list_dir_or_file` returns the path relative to ``dir_path``.

    Args:
        dir_path (str or Path): Path of the directory.
        list_dir (bool): List the directories. Defaults to True.
        list_file (bool): List the path of files. Defaults to True.
        suffix (str or tuple[str], optional): File suffix that we are
            interested in. Defaults to None.
        recursive (bool): If set to True, recursively scan the directory.
            Defaults to False.

    Yields:
        Iterable[str]: A relative path to ``dir_path``.

    Examples:
        >>> backend = LocalBackend()
        >>> dir_path = '/path/of/dir'
        >>> # list those files and directories in current directory
        >>> for file_path in backend.list_dir_or_file(dir_path):
        ...     print(file_path)
        >>> # only list files
        >>> for file_path in backend.list_dir_or_file(dir_path, list_dir=False):
        ...     print(file_path)
        >>> # only list directories
        >>> for file_path in backend.list_dir_or_file(dir_path, list_file=False):
        ...     print(file_path)
        >>> # only list files ending with specified suffixes
        >>> for file_path in backend.list_dir_or_file(dir_path, suffix='.txt'):
        ...     print(file_path)
        >>> # list all files and directory recursively
        >>> for file_path in backend.list_dir_or_file(dir_path, recursive=True):
        ...     print(file_path)
    """  # noqa: E501
    if list_dir and suffix is not None:
        raise TypeError('`suffix` should be None when `list_dir` is True')

    if (suffix is not None) and not isinstance(suffix, (str, tuple)):
        raise TypeError('`suffix` must be a string or tuple of strings')

    root = dir_path

    def _list_dir_or_file(dir_path, list_dir, list_file, suffix,
                          recursive):
        for entry in os.scandir(dir_path):
            if not entry.name.startswith('.') and entry.is_file():
                rel_path = osp.relpath(entry.path, root)
                if (suffix is None
                        or rel_path.endswith(suffix)) and list_file:
                    yield rel_path
            elif osp.isdir(entry.path):
                if list_dir:
                    rel_dir = osp.relpath(entry.path, root)
                    yield rel_dir
                if recursive:
                    yield from _list_dir_or_file(entry.path, list_dir,
                                                 list_file, suffix,
                                                 recursive)

    return _list_dir_or_file(dir_path, list_dir, list_file, suffix,
                             recursive)

I then edited that function as follows (I believe there is an infinite recursion going on):

def list_dir_or_file(self,
                     dir_path: Union[str, Path],
                     list_dir: bool = True,
                     list_file: bool = True,
                     suffix: Optional[Union[str, Tuple[str]]] = None,
                     recursive: bool = False) -> Iterator[str]:
    """Scan a directory to find the interested directories or files in
    arbitrary order.

    Note:
        :meth:`list_dir_or_file` returns the path relative to ``dir_path``.

    Args:
        dir_path (str or Path): Path of the directory.
        list_dir (bool): List the directories. Defaults to True.
        list_file (bool): List the path of files. Defaults to True.
        suffix (str or tuple[str], optional): File suffix that we are
            interested in. Defaults to None.
        recursive (bool): If set to True, recursively scan the directory.
            Defaults to False.

    Yields:
        Iterable[str]: A relative path to ``dir_path``.

    Examples:
        >>> backend = LocalBackend()
        >>> dir_path = '/path/of/dir'
        >>> # list those files and directories in current directory
        >>> for file_path in backend.list_dir_or_file(dir_path):
        ...     print(file_path)
        >>> # only list files
        >>> for file_path in backend.list_dir_or_file(dir_path, list_dir=False):
        ...     print(file_path)
        >>> # only list directories
        >>> for file_path in backend.list_dir_or_file(dir_path, list_file=False):
        ...     print(file_path)
        >>> # only list files ending with specified suffixes
        >>> for file_path in backend.list_dir_or_file(dir_path, suffix='.txt'):
        ...     print(file_path)
        >>> # list all files and directory recursively
        >>> for file_path in backend.list_dir_or_file(dir_path, recursive=True):
        ...     print(file_path)
    """  # noqa: E501
    if list_dir and suffix is not None:
        raise TypeError('`suffix` should be None when `list_dir` is True')

    if (suffix is not None) and not isinstance(suffix, (str, tuple)):
        raise TypeError('`suffix` must be a string or tuple of strings')

    root = dir_path
    visited_paths = set()
    
    def _list_dir_or_file(dir_path, list_dir, list_file, suffix,
                          recursive):
        for entry in os.scandir(dir_path):
            if entry.is_symlink():
                continue  # Skip symbolic links

            if entry.path in visited_paths:
                continue  # Skip already visited paths

            visited_paths.add(entry.path)

            if not entry.name.startswith('.') and entry.is_file():
                rel_path = osp.relpath(entry.path, root)
                if (suffix is None
                        or rel_path.endswith(suffix)) and list_file:
                    yield rel_path
            elif osp.isdir(entry.path): 
                if list_dir:
                    rel_dir = osp.relpath(entry.path, root)
                    yield rel_dir
                if recursive:
                    yield from _list_dir_or_file(entry.path, list_dir,
                                                 list_file, suffix,
                                                 recursive)
        # print('visited_paths:', visited_paths)
    return _list_dir_or_file(dir_path, list_dir, list_file, suffix,
                             recursive)

I then did a docker commit to save those modifications. With this new container I ran again using the same command (docker run …) with the same arguments. The above error did not occur, however now receiving multiple torch error related to tensor shape:

UserWarning: Please pay attention your ground truth segmentation map, usually the segmentation map is 2D, but got (456, 512, 4)

Now, my dataset has not changed and it successfully trains the exact same dataset (all images 512x512x3 PNG Mask 512x512 PNG). The training spec is the same except for pointing to the correct dataset locations (to take into account running the docker directly).

I’m wondering if the launcher code does some other manipulation of the dataset prior to sending to torch? Any ideas?

cheers

To narrow down, could you please try to use a new result folder when your run with docker?
For example,

docker run -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/ubuntu/DBI-9/Transfer/WindowService:/workspace/tao-experiments nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt segformer train -e /workspace/tao-experiments/specs/WindowsV2/MLOps/mit_b5/512/train.yaml -r /workspace/tao-experiments/new_results

Hi @Morganh

Thank you for the quick response. I tried the change as you suggested but unfortunately same error. It’s quite strange how the tensor shapes are so wrong (as reported by torch) and as I said the dataset is exactly the same one I used when running from the launcher and that runs fine.

My changes to the mmengine may be the problem here but I would need to dig further into the TAO code to understand what that local_backend.py is used for. So was wondering if this had been seen before or if there is any view on what TAO uses local_backend.py for. I did some research where this error 40 can occur where symbolic links recursively loop.

Again - the two maybe totally unrelated so it would be useful to understand if direct invocation of the docker has been used successfully for a TAO 5.3 Segformer run.

Thank you for your help.

Cheers

I suggest to not change the mmengine code. Seems that the recursively loop happens in building dataloader.
For tao-launcher, you mentioned that it can work. That means when you run something like tao model segformer xxx , it can work. For tao-launcher, the path mapping is defined in ~/.tao_mounts.json file.
For running container, the mapping is set in -v /home/ubuntu/DBI-9/Transfer/WindowService:/workspace/tao-experiments.
Can you check the .tao_mounts.json to compare?
Also, to narrow down, you can copy the dataset to another place, and run again with the docker/container.

Hi @Morganh

I tried moving the dataset to a new location, adjusted the train.yaml configuration to suit, used my modified mmengine code but unfortunately same tensor error error occurs.

If I try the unmodified container then the error 40 above occurs no matter what I do as it occurs earlier in the cycle.

For my use case I need to run the docker directly. I am very familiar with the ~/.tao_mounts.json file but this is for a launcher set up.

As an alternative, is there a reliable way to run the launcher from code, rather than from a jupyter cell?

cheers

For clarity (to my last question), when I use :

os.system(“tao model segformer …”)

I get the warning:

[TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
the input device is not a TTY

I got round that problem before by launching the docker container without the “-t” flag but I can’t do that with the launcher as I handles the docker invocation.

Since this error is the new error after you modified the mmegine code, as mentioned previously, I suggest you to change back the code. Instead of jupyter cell, I suggest you to open a terminal to run.

Step:

  1. Open a terminal.
  2. $ cat ~/.tao_mounts.json
  3. $ tao model segformer train xxx (This is the case for running tao-launcher)
    Please check if this can run successfully. That means you can reproduce the tao-launcher case which can run successfully as you mentioned.
  4. If step3 works, then you can move to container case.
    $ docker run xxx (This is the case for running container)

Hi @Morganh

I’ve actually tried all of that, which led to my original post. I however repeated the steps as per your request and can report as follows:

Step 3 works, that is running using the launcher consumes the dataset and I get reasonable results.
Step 4 fails with

“OSError: [Errno 40] Too many levels of symbolic links: ‘/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/stderr’”

This error caused me to investigate root cause that is in mmengine, fix that issue but then the tensor shape issue arose.

As per my question yesterday and as a temporary WAR, how can I run the segformer model from code?

I have looked at NVIDIA’s tao_pytorch_backend repo but this is now unsupported and the instructions appear confused (the repo discusses how to configure tao_pt to launch an interactive docker instance, whereas the the website talks about invoking the tao task through passing the model.py file [such as segformer.py] to a python interpreter) - these two approaches are incompatible.

Grateful for any guidance you can provide regarding my WAR question above. Thank you.

It is unexpected to see this kind of error in step4. Because we often run both two cases(step3 and step4)
I suggest you to narrow down in step4 by

  • changing the mapping, i.e., change -v /home/ubuntu/DBI-9/Transfer/WindowService:/workspace/tao-experiments to others.
    For example, -v /home/ubuntu/DBI-9/Transfer/WindowService:/home/tao/
  • changing to another dataset. For example, use a dataset which is confirmed to work previously.

BTW, please check if there are symbolic links in your datasets.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.