Unable to Train Efficientdet on Multiple GPUS

Please use below to monitor the gpu memory.
$ nvidia-smi --query-gpu=memory.used,memory.total --format=csv -i 0 -l 1

And please also set training batch size to 1 and retry.

Hello @Morganh,
I tired with the batch size of 1 as well. It did not work

Could you try a smaller backbone efficientdet-d0 ?

Hello @Morganh,
Same error even with backbone efficientdet-d0

Could you please share the $nvidia-smi ?

Output of nvidia-smi

admin@r500-212c12:~$ nvidia-smi
Tue Jun 21 09:44:08 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:81:00.0 Off |                  N/A |
| 31%   45C    P8    17W / 350W |   3937MiB / 24265MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:C1:00.0 Off |                  N/A |
| 30%   31C    P8    17W / 350W |      6MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1388      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      1593      G   /usr/bin/gnome-shell                8MiB |
|    0   N/A  N/A   1767448      C   tritonserver                     3913MiB |
|    1   N/A  N/A      1388      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Please share below as well.
$ nvidia-smi topo -m

Hello @Morganh,
Here is the output:

admin@r500-212c12:~$ nvidia-smi topo -m
             GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	         SYS	0-31		N/A
GPU1	SYS	            X 	0-31		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Please use the latest nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 to check if it works.

Hi @Morganh,
Can you tell me how to update my tao-toolkit?

Hello @Morganh,
I updated the tao-toolkit using the following command

(tlt_env) admin@r500-212c12:~$ pip3 install --upgrade nvidia-tao

Now I have the following tao

(tlt_env) admin@r500-212c12:~$ tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022

But again it is the same error while training

Thanks. I will try to check if I can reproduce.

Hello @Morganh,
I would like to ask were you able to reproduce the issue?

Sorry, I will update to you later.

No worries !!

Hi,
I cannot reproduce your result. I trigger training with 2 * V100 gpus without issues.
From your log, there are many errors as follows.

r500-212c12:102:234 [0] transport/p2p.cc:136 NCCL WARN Cuda failure 'peer access is not supported between these two devices'

I suggest you to update nvidia-driver to 510.
$ sudo apt install nvidia-driver-510
$ sudo reboot

And also install latest nvidia-tao.
$ pip3 install nvidia-tao==0.1.24

I run my training with 22.05 tf 15.5 docker. You can use it as well.

Hello @Morganh,
Thanks for the information. I will update the system and get back to you !!