YOLOv4 with CSP Darknet Validation taking up excessive RAM

While running YOLOv4 with CSP Darknet19 during validation I have constant network failure due to an aggressive amount of RAM being used during the validation phase. During the training epoch the experiment rests at about 14-15 gigabytes of RAM. However, during the validation phase, the RAM usage sporadically increases, at some points eventually maxing out all 64 gigs of my onboard RAM. Sometimes this happens on the first epoch, sometimes on the 7th. In other cases the RAM increases to a bearable amount during validation but then retains that RAM for the rest of the experiment. If this happens several times in a row, it also maxes out the machine. This is only an issue with my onboard RAM and not VRAM on my graphics card. I have only noticed this on YOLOv4 with cspdarknet 19, however this is also the only experiment I have ran since pulling the most recent TLT 3.0 container. below is my experiment config:

experiment_spec.txt (4.3 KB)

To narrow down, could you help to run yolo_v4 notebook with default KITTI dataset? NVIDIA TAO Documentation

Here are my RAM usage results from the KITTI using the notebook:
prior to start: 7.7 gigs
epoch1 train: 19.1 gigs
epoch1 val: 26.8 gigs
epoch2 train: 25.5 gigs
epoch2 val: 30.0 gigs
epoch3 train: 26.6 gigs
epoch3 val: 29.6 gigs

It remained at roughly these usage levels for the following 10 epochs

Which dgpu did you run?
More, please check if there are other application consuming memory.

There are other applications that are taking up the initial 7.7 gigs. However, these applications did not change their usage during the course of the experiment. The dGPU is an RTX 3090

Can you set lower output_width/height and retry? For example,

  output_width: 960
  output_height: 544

More, did you meet this issue when you ever run other dataset, for example, KITTI public dataset?

The architecture ran fine with smaller datasets. I am more curious to know if I will always see a large spike in RAM usage after the first validation period and if it is something that I should plan for or if it is unique to that experiment. I am currently running a YOLO v4 with a ResNet 18 backbone on the same dataset that caused the issue with the YOLO v4 CSPDarknet 19 and I have yet to exceed 35 gigs of RAM usage, though this one also had a significant spike from 20 gigs to ~30-34 gigs after the first validation period and has not gone down after the validation.

How many training and validation images in your dataset?

Roughly 40,000/10,000

Please check if above input size helps.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.