While running YOLOv4 with CSP Darknet19 during validation I have constant network failure due to an aggressive amount of RAM being used during the validation phase. During the training epoch the experiment rests at about 14-15 gigabytes of RAM. However, during the validation phase, the RAM usage sporadically increases, at some points eventually maxing out all 64 gigs of my onboard RAM. Sometimes this happens on the first epoch, sometimes on the 7th. In other cases the RAM increases to a bearable amount during validation but then retains that RAM for the rest of the experiment. If this happens several times in a row, it also maxes out the machine. This is only an issue with my onboard RAM and not VRAM on my graphics card. I have only noticed this on YOLOv4 with cspdarknet 19, however this is also the only experiment I have ran since pulling the most recent TLT 3.0 container. below is my experiment config:
There are other applications that are taking up the initial 7.7 gigs. However, these applications did not change their usage during the course of the experiment. The dGPU is an RTX 3090
The architecture ran fine with smaller datasets. I am more curious to know if I will always see a large spike in RAM usage after the first validation period and if it is something that I should plan for or if it is unique to that experiment. I am currently running a YOLO v4 with a ResNet 18 backbone on the same dataset that caused the issue with the YOLO v4 CSPDarknet 19 and I have yet to exceed 35 gigs of RAM usage, though this one also had a significant spike from 20 gigs to ~30-34 gigs after the first validation period and has not gone down after the validation.