TAO Toolkit Version 5.3 - Segformer ValueError: need at least one array to concatenate

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) A6000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi

I created a new python env and installed the TAO 5.3 Launcher. Ran exactly the same model as for 5.2 (same mounts, same dataset and same spec file) but it fails with:

Looking at the errors and tracing it through the mmseg and NVIDA-TAO sources on GitHub it fails because a call to Numpy’s concatenate fails. This is usually due to an array that is passed being empty. The failure appears to be around where datasets are loaded. But I have not changed any datasets or the spec file. As a check I repointed my notebook to TAO V5.2 kernel and reran without any errors.

I know that 5.3 is new so wondering if this has come up yet?

Cheers

Error below.

/usr/local/lib/python3.10/dist-packages/mmseg/engine/hooks/visualization_hook.py:60: UserWarning: The draw is False, it means that the hook for visualization will not take effect. The results will NOT be visualized or stored.
warnings.warn('The draw is False, it means that the ’
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/dataloader/loading.py:53: UserWarning: reduce_zero_label will be deprecated, if you would like to ignore the zero label, please set reduce_zero_label=True when dataset initialized
warnings.warn('reduce_zero_label will be deprecated, ’
Error executing job with overrides: [‘train.num_gpus=2’, ‘results_dir=/workspace/tao-experiments/results/Ex4’]
An error occurred during Hydra’s exception formatting:
AssertionError()
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 254, in run_and_report
assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 123, in
main()
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py”, line 107, in wrapper
_run_hydra(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 389, in _run_hydra
_run_app(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 452, in _run_app
run_and_report(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 296, in run_and_report
raise ex
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 213, in run_and_report
return func()
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 453, in
lambda: hydra.run(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py”, line 132, in run
_ = ret.return_value
File “/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py”, line 260, in return_value
raise self._return_value
File “/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py”, line 186, in run_job
ret.return_value = task_function(task_cfg)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 119, in main
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 106, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 85, in run_experiment
runner.train()
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1728, in train
self._train_loop = self.build_train_loop(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1520, in build_train_loop
loop = LOOPS.build(
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py”, line 219, in init
super().init(runner, dataloader)
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/base_loop.py”, line 26, in init
self.dataloader = runner.build_dataloader(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1370, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmseg/datasets/basesegdataset.py”, line 142, in init
self.full_init()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 307, in full_init
self.data_bytes, self.data_address = self._serialize_data()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 768, in _serialize_data
data_bytes = np.concatenate(data_list)
File “<array_function internals>”, line 200, in concatenate
ValueError: need at least one array to concatenate
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/dataloader/loading.py:53: UserWarning: reduce_zero_label will be deprecated, if you would like to ignore the zero label, please set reduce_zero_label=True when dataset initialized
warnings.warn('reduce_zero_label will be deprecated, ’
need at least one array to concatenate
Error executing job with overrides: [‘train.num_gpus=2’, ‘results_dir=/workspace/tao-experiments/results/Ex4’]
An error occurred during Hydra’s exception formatting:
AssertionError()
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 254, in run_and_report
assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 123, in
main()
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py”, line 107, in wrapper
_run_hydra(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 389, in _run_hydra
_run_app(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 452, in _run_app
run_and_report(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 296, in run_and_report
raise ex
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 213, in run_and_report
return func()
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 453, in
lambda: hydra.run(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py”, line 132, in run
_ = ret.return_value
File “/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py”, line 260, in return_value
raise self._return_value
File “/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py”, line 186, in run_job
ret.return_value = task_function(task_cfg)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 119, in main
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 106, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 85, in run_experiment
runner.train()
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1728, in train
self._train_loop = self.build_train_loop(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1520, in build_train_loop
loop = LOOPS.build(
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py”, line 219, in init
super().init(runner, dataloader)
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/base_loop.py”, line 26, in init
self.dataloader = runner.build_dataloader(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1370, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmseg/datasets/basesegdataset.py”, line 142, in init
self.full_init()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 307, in full_init
self.data_bytes, self.data_address = self._serialize_data()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 768, in _serialize_data
data_bytes = np.concatenate(data_list)
File “<array_function internals>”, line 200, in concatenate
ValueError: need at least one array to concatenate
[2024-04-03 22:41:25,154] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 404) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 351, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 806, in main
run(args)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 797, in run
elastic_launch(
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

The numpy version is updated in 5.3. Not sure if it is due to this. I will check further.
Could you try to docker run --runtime=nvidia -it -v localfolder:dockerfolder --rm nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt /bin/bash to check if it works?

Hi @Morganh
Thank you for the very quick response.

I tried the docker suggestions and I need to remove the

–runtime=nvidia

for the container to run. The container reports that the GPUs were not found. Since I have the container toolkit installed I tried again with the

–gpus all

flag and the container found the gpus.

I then tried the following once in the container and in the right mapped folder:

segformer -r ./results -e ./specs/train.yaml -gpus 2 train

and I get the same numpy error as before.

Hope that helps. Thank you for all your help with this.

Cheers

This is not expected. Did you install nvidia-docker2 ?
$ sudo apt install nvidia-container-toolkit
$ sudo apt-get install nvidia-docker2
$ sudo pkill -SIGHUP dockerd

Hi @Morganh

I confirm that I had not installed nvidia-docker2 as assumed this was now superseded by the container toolkit. However I went ahead and installed it anyway and the error is still present.

Please let me know what you find internally. Thank you.

cheers

Hi @IainA
Actually I cannot reproduce this error.
I run on two dockers.
nvcr.io/nvidia/tao/tao-toolkit:5.2.0.1-pyt1.14.0
nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt

Attach my logs for your reference.
20240406_forum_288481_tao_5.2.txt (11.9 KB)
20240406_forum_288481_spec.txt (1.6 KB)
20240406_forum_288481_tao_5.3.txt (41.8 KB)

Could you set all the path to absolute path?
For example, in command line,
change ./specs/train.yaml to /home/xxx/specs/train.yaml.

Also please change all the dataset paths inside the training spec file as well.

Hi @Morganh

Thank you for the update. I notice that your spec file has max_iters at 220 AND validation_interval at 220 as well. I suspect that this will not fail as validation is not less than 220. My issues occur if validation interval is less than max_iters.

Just to make sure - as I understand it, if validation_interval is equal to or above max_iters then the validation dataset is not used by the optimizer to inform any backprop or other transformer operations? This is despite validate being set to true? This is a key understanding, please confirm.

If this is the case then when validation is ran (either as an explicit tao evaluation command or at the end of the training cycle - per above) then essentially the validation set is a pure unseen “test” set. It’s interesting that I get reasonable results when I set the validation_interval above the max_iters.

I will turn my attention to the 5.3 issues in which you provide above input in the meantime. Thanks again.

cheers

Hi, @IainA
I run set to validation_interval: 20 and max_iters: 220. The training is also fine in TAO5.3. See below training log(spec file inside).
20240410_forum_288481_tao_5.3.txt (82.7 KB)
BTW, I add some debug log in /usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py.

The validation_inferval is used as the interval number of iterations at which validation should be performed. It is not related to transformer operations.

Yes, above-mentioned is not related to this topic. For the ValueError issue you mentioned in this topic, did you ever try isbi dataset? As far as I known, the TAO5.3 has a newer version of numpy. But as shared above, I still cannot reproduce the error.

Thank you @Morganh

That would explain why I am getting good results. The only drawback then is not getting reports on how well the model is converging for each class.

I appreciate the lengths you have gone to in order to help.

Cheers

Hi @Morganh

So this issue with numpy concatenate appears to be related to the image files and the masks being of the same file type.

I’ve tried some tests with the masks and image files being of the same type (for my experiments both types were .png rather than images of .jpeg and masks of (have to be) .png.

This seems to be at odds to the documentation:

Image and Mask Loading Format

SegFormer

For SegFormer, the path to images and mask folders can directly be provided in the dataset_config_segformer. Ensure that the image and the corresponding mask names are same. The image and mask extension don’thave to be the same.

I’m happy to close this issue but please let me know if you want me to do more tests.

Thank you for your patience with this long running issue. I appreciate your help.

Cheers

Thanks for the information! Will try your case.

Hi @Morganh

Some further updates.

  1. I noticed I had to run the 5.3 segformer with the container running under root privileges.
  2. The pytorch implementation with 5.3 requires all images to be the same size during the validation hook runs. This was not the case with 5.2 and introduces data prep chores.
  3. I used the exact same data set for 5.2 and 5.3, however the 5.3 runs gave very poor (I would say random or numerically unstable) results where as 5.2 gave excellent results.
  4. I could not get any results (i.e. other than NaN) using 5.3 for fan models. I was able to get “results” as per my para 3 by using the mit_b5 backbone (but they were poor/meaningless)

My datasets are custom but I get great results (and I’ve exported to TRT and using in triton successfully in production) from 5.2.

Hope this provides some further insights.

cheers

So, item2~4 seems to be regression in 5.3. Could you please create a new topic since it is different form this topic?
I will try to reproduce.

Hi @Morganh - just posted. Thanks.

Cheers

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.