Please use below to monitor the gpu memory.
$ nvidia-smi --query-gpu=memory.used,memory.total --format=csv -i 0 -l 1
And please also set training batch size to 1 and retry.
Please use below to monitor the gpu memory.
$ nvidia-smi --query-gpu=memory.used,memory.total --format=csv -i 0 -l 1
And please also set training batch size to 1 and retry.
Could you try a smaller backbone efficientdet-d0 ?
Could you please share the $nvidia-smi ?
Output of nvidia-smi
admin@r500-212c12:~$ nvidia-smi
Tue Jun 21 09:44:08 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:81:00.0 Off | N/A |
| 31% 45C P8 17W / 350W | 3937MiB / 24265MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:C1:00.0 Off | N/A |
| 30% 31C P8 17W / 350W | 6MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1388 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1593 G /usr/bin/gnome-shell 8MiB |
| 0 N/A N/A 1767448 C tritonserver 3913MiB |
| 1 N/A N/A 1388 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
Please share below as well.
$ nvidia-smi topo -m
Hello @Morganh,
Here is the output:
admin@r500-212c12:~$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X SYS 0-31 N/A
GPU1 SYS X 0-31 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hello @Morganh,
I updated the tao-toolkit using the following command
(tlt_env) admin@r500-212c12:~$ pip3 install --upgrade nvidia-tao
Now I have the following tao
(tlt_env) admin@r500-212c12:~$ tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022
But again it is the same error while training
Thanks. I will try to check if I can reproduce.
Sorry, I will update to you later.
No worries !!
Hi,
I cannot reproduce your result. I trigger training with 2 * V100 gpus without issues.
From your log, there are many errors as follows.
r500-212c12:102:234 [0] transport/p2p.cc:136 NCCL WARN Cuda failure 'peer access is not supported between these two devices'
I suggest you to update nvidia-driver to 510.
$ sudo apt install nvidia-driver-510
$ sudo reboot
And also install latest nvidia-tao.
$ pip3 install nvidia-tao==0.1.24
I run my training with 22.05 tf 15.5 docker. You can use it as well.