Unable to Train Efficientdet on Multiple GPUS

• Hardware (T4/V100/Xavier/Nano/etc) - Xavier NX
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) - EfficientDet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) - 3.22.02

Hello everyone,
I am trying to train an efficientdet model on TAO. I am able to train the model when I am using a single GPU. But when I specify the number of GPUs as follows, I get an error which I am attaching in the error log

%env NUM_GPUS=2

Error Log:
error_log.txt (270.9 KB)

Looking Forward to your reply

Did you ever try other network?

Hello @Morganh ,
Yes. When using DetectnetV2 with resnet34/resnet18/mobilenetv2 as backbone architecture I was able to train using both the GPUs on my system. Only with EfficientDet it does not work.

Please try double check in terminal.
More, please use a new result folder and retry.

Hi @Morganh,
Double-check as in?
Could you explain what does new result folder mean?

Double check in terminal instead of jupyter notebook.
A new result folder means that please use a result folder when you type the training command.

OK, Thanks for the info. Right now I have kept something for training. I will get back to you in some days with the output

Hello @Morganh,
Back with the output, On the terminal after launching the jupyter notebook and running the training cell I can only see the following content.

(tlt_env) admin@r500-212c12:~/cv_samples_1.3/efficientdet$ jupyter-lab
[I 2022-06-20 12:07:30.969 ServerApp] jupyterlab | extension was successfully linked.
[I 2022-06-20 12:07:31.096 ServerApp] nbclassic | extension was successfully linked.
[I 2022-06-20 12:07:31.118 LabApp] JupyterLab extension loaded from /home/admin/.virtualenvs/tlt_env/lib/python3.8/site-packages/jupyterlab
[I 2022-06-20 12:07:31.118 LabApp] JupyterLab application directory is /home/admin/.virtualenvs/tlt_env/share/jupyter/lab
[I 2022-06-20 12:07:31.120 ServerApp] jupyterlab | extension was successfully loaded.
[I 2022-06-20 12:07:31.122 ServerApp] nbclassic | extension was successfully loaded.
[I 2022-06-20 12:07:31.123 ServerApp] Serving notebooks from local directory: /home/admin/cv_samples_1.3/efficientdet
[I 2022-06-20 12:07:31.123 ServerApp] Jupyter Server 1.4.1 is running at:
[I 2022-06-20 12:07:31.123 ServerApp] http://localhost:8888/lab?token=865eb874be458c32f4d4476712f9281aecf861217713ac52
[I 2022-06-20 12:07:31.123 ServerApp]  or http://127.0.0.1:8888/lab?token=865eb874be458c32f4d4476712f9281aecf861217713ac52
[I 2022-06-20 12:07:31.123 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2022-06-20 12:07:31.191 ServerApp] 
    
    To access the server, open this file in a browser:
        file:///home/admin/.local/share/jupyter/runtime/jpserver-2894470-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/lab?token=865eb874be458c32f4d4476712f9281aecf861217713ac52
     or http://127.0.0.1:8888/lab?token=865eb874be458c32f4d4476712f9281aecf861217713ac52
[W 2022-06-20 12:07:34.956 LabApp] Could not determine jupyterlab build status without nodejs
(127.0.0.1) 1.68ms referer=http://localhost:8888/lab
/usr/lib/python3.8/json/encoder.py:257: UserWarning: date_default is deprecated since jupyter_client 7.0.0. Use jupyter_client.jsonutil.json_default.
  return _iterencode(o, 0)
[I 2022-06-20 12:07:39.190 ServerApp] Kernel started: 008da0cf-a63d-4d37-aac3-2c5bfb9cacc6
/usr/lib/python3.8/json/encoder.py:257: UserWarning: date_default is deprecated since jupyter_client 7.0.0. Use jupyter_client.jsonutil.json_default.
  return _iterencode(o, 0)
[IPKernelApp] ERROR | No such comm target registered: jupyter.widget.control
[IPKernelApp] WARNING | No such comm: 0175ed2b-59ad-4ce7-9ca8-a3eefb421cfc
[W 2022-06-20 12:07:42.565 ServerApp] Got events for closed stream <zmq.eventloop.zmqstream.ZMQStream object at 0x7f99b8221df0>
[I 2022-06-20 12:09:38.587 ServerApp] Saving file at /efficientdet-exp.ipynb
[I 2022-06-20 12:11:38.658 ServerApp] Saving file at /efficientdet-exp.ipynb
[I 2022-06-20 12:13:38.736 ServerApp] Saving file at /efficientdet-exp.ipynb

But I guess this is not what you wanted right?
Can you tell me what I should be looking for in the terminal and how to do that?

Please try to run training via terminal. See below.
(tlt_env) admin@r500-212c12:~/cv_samples_1.3/efficientdet$ tao efficientdet train xxx

Hello @Morganh,
I tried the command which you gave me !! However it is throwing out the following error even though the experiment spec file is present at that location

(tlt_env) admin@r500-212c12:~/tao-experiments/efficientdet$ tao efficientdet train --gpus 2 --use_amp -e /home/admin/cv_samples_1.3/efficientdet/specs/efficientdet_d1_train.txt -d /home/admin/tao-experiments/efficientdet/experiment_dir_unpruned -k nvidia_tlt
2022-06-20 14:52:53,743 [INFO] root: Registry: ['nvcr.io']
2022-06-20 14:52:53,857 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-gteoypbk because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Loading experiment spec at %s. /home/admin/cv_samples_1.3/efficientdet/specs/efficientdet_d1_train.txt
Using TensorFlow backend.
2022-06-20 14:53:00,737 [INFO] iva.efficientdet.utils.spec_loader: Merging specification from /home/admin/cv_samples_1.3/efficientdet/specs/efficientdet_d1_train.txt
Loading experiment spec at %s. /home/admin/cv_samples_1.3/efficientdet/specs/efficientdet_d1_train.txt
Using TensorFlow backend.
2022-06-20 14:53:00,738 [INFO] iva.efficientdet.utils.spec_loader: Merging specification from /home/admin/cv_samples_1.3/efficientdet/specs/efficientdet_d1_train.txt
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/scripts/train.py", line 117, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/scripts/train.py", line 35, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/utils/spec_loader.py", line 79, in load_experiment_spec
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/utils/spec_loader.py", line 59, in load_proto
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/utils/spec_loader.py", line 44, in _load_from_file
FileNotFoundError: [Errno 2] No such file or directory: '/home/admin/cv_samples_1.3/efficientdet/specs/efficientdet_d1_train.txt'
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/scripts/train.py", line 117, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/scripts/train.py", line 35, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/utils/spec_loader.py", line 79, in load_experiment_spec
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/utils/spec_loader.py", line 59, in load_proto
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/efficientdet/utils/spec_loader.py", line 44, in _load_from_file
FileNotFoundError: [Errno 2] No such file or directory: '/home/admin/cv_samples_1.3/efficientdet/specs/efficientdet_d1_train.txt'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[33076,1],1]
  Exit code:    1
--------------------------------------------------------------------------
2022-06-20 14:53:02,988 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
(tlt_env) admin@r500-212c12:~/tao-experiments/efficientdet$

Is there something wrong that I am doing ??

FileNotFoundError: [Errno 2] No such file or directory: ‘/home/admin/cv_samples_1.3/efficientdet/specs/efficientdet_d1_train.txt’

This kind of issue is usually due to wrong ~/tao_mounts.json.
Please make sure you set correct json file to map your local files to the docker.

Hello @Morganh,
Thank you for your reply !! To map the local files to the docker is there some command that needs to be executed ?
Until now I was only using the jupyter notebook to train. I haven’t done the training directly on the terminal. Can you provide the previous steps needed to execute before the train command?
Sorry for the trouble

Just need to set correct ~/.tao_mounts.json.
Refer to TAO Toolkit Launcher — TAO Toolkit 3.22.05 documentation

Hello @Morganh,
Thank you for assisting me. I did the necessary changes and the error got removed but I got the same error that I was getting on the jupyter notebook when I was training with 2 GPUS. On Single GPU it is training. I am attaching the error log below:

Error_log_on_terminal.txt (277.2 KB)

Do you have an idea why this is happening?

To narrow down, for 2gpus, may I know that if it works on terminal instead of jupyter notebook?

Hello @Morganh,
No, it does not work for 2 GPUS on the terminal.

So, can you confirm that only Efficientdet cannot train with even 2 gpus, but other network can train?

Hello @Morganh,
Yes, other networks such as detectnet_v2 worked with 2 GPUS

Please share your efficientdet_d1_train.txt. Thanks.

Hello @Morganh,
Here is the spec file:
efficientdet_d1_train.txt (1.2 KB)