YOLOv4 with CSP Darknet Validation taking up excessive RAM

quinn · July 9, 2021, 10:45pm

While running YOLOv4 with CSP Darknet19 during validation I have constant network failure due to an aggressive amount of RAM being used during the validation phase. During the training epoch the experiment rests at about 14-15 gigabytes of RAM. However, during the validation phase, the RAM usage sporadically increases, at some points eventually maxing out all 64 gigs of my onboard RAM. Sometimes this happens on the first epoch, sometimes on the 7th. In other cases the RAM increases to a bearable amount during validation but then retains that RAM for the rest of the experiment. If this happens several times in a row, it also maxes out the machine. This is only an issue with my onboard RAM and not VRAM on my graphics card. I have only noticed this on YOLOv4 with cspdarknet 19, however this is also the only experiment I have ran since pulling the most recent TLT 3.0 container. below is my experiment config:

experiment_spec.txt (4.3 KB)

Morganh · July 10, 2021, 2:44pm

To narrow down, could you help to run yolo_v4 notebook with default KITTI dataset? NVIDIA TAO Documentation

quinn · July 12, 2021, 6:09pm

Here are my RAM usage results from the KITTI using the notebook:
prior to start: 7.7 gigs
epoch1 train: 19.1 gigs
epoch1 val: 26.8 gigs
epoch2 train: 25.5 gigs
epoch2 val: 30.0 gigs
epoch3 train: 26.6 gigs
epoch3 val: 29.6 gigs

It remained at roughly these usage levels for the following 10 epochs

Morganh · July 13, 2021, 12:05am

Which dgpu did you run?
More, please check if there are other application consuming memory.

quinn · July 13, 2021, 12:31am

There are other applications that are taking up the initial 7.7 gigs. However, these applications did not change their usage during the course of the experiment. The dGPU is an RTX 3090

Morganh · July 13, 2021, 3:01am

Can you set lower output_width/height and retry? For example,

  output_width: 960
  output_height: 544

More, did you meet this issue when you ever run other dataset, for example, KITTI public dataset?

quinn · July 13, 2021, 3:42am

The architecture ran fine with smaller datasets. I am more curious to know if I will always see a large spike in RAM usage after the first validation period and if it is something that I should plan for or if it is unique to that experiment. I am currently running a YOLO v4 with a ResNet 18 backbone on the same dataset that caused the issue with the YOLO v4 CSPDarknet 19 and I have yet to exceed 35 gigs of RAM usage, though this one also had a significant spike from 20 gigs to ~30-34 gigs after the first validation period and has not gone down after the validation.

Morganh · July 13, 2021, 3:45am

How many training and validation images in your dataset?

quinn · July 13, 2021, 4:06am

Roughly 40,000/10,000

Morganh · July 13, 2021, 4:08am

Please check if above input size helps.

system · September 11, 2021, 4:08am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.