TAO YOLOv4 training fails on multi-GPU instances with Tensorboard visualiser

douglas.l · August 19, 2022, 2:55pm

On an AWS g4dn.12xlarge, after adding visualizer { enabled: True } to my spec file, calling tao yolo_v4 train fails if --gpus 4 is set with the following error:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 145, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 707, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 695, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 141, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 126, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 77, in run_experiment
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/yolov4_model.py", line 715, in train
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/fit_generator.py", line 222, in fit_generator
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1211, in train_on_batch
    class_weight=class_weight)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 789, in _standardize_user_data
    exception_prefix='target')
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py", line 92, in standardize_input_data
    data = [standardize_single_array(x) for x in data]
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py", line 92, in <listcomp>
    data = [standardize_single_array(x) for x in data]
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py", line 27, in standardize_single_array
    elif x.ndim == 1:
AttributeError: 'tuple' object has no attribute 'ndim'

I’m using:

v3.22.05-tf1.15.5-py3
toolkit_version: 3.22.05
published_date: 05/25/2022

Are there any workarounds?

Morganh · August 20, 2022, 3:07pm

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Could you share the training spec file?

system · September 20, 2022, 5:19am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.