RAM limitation- Training process "Killed" before first iteration

Hello,

I trained many networks in CLARA before with <200MB volumes. These volumes are ~350 voxels cubed, but I always trained by feeding only patches into the network due to VRAM limitations.

Now, I want to train the network on much larger volumes (almost 10GB each, around 2000 voxels cubed- of course, I will still be using patching). When I try to start this training the first iteration never runs. System RAM maxes out over ~10 mins and then it crashes with a “Killed” message.

So my question is: Is clara trying to fit whole volumes into RAM? I am not sure what is going on at this stage of the network prepping to train that would cause the crash.

edit: It begins training with just one 10 GB scan in train/validation sets, but if I have multiple 10GB scans in each set it crashes before the first iteration. Why would clara be trying to put my entire dataset into RAM before training?

Kyle

Hi
Thanks for your interest in Clara Train SDK. Please note we have recently release clara train V4.0 based on MONAI which uses PyTorch. Please check out the notebooks to get you started clara-train-examples/PyTorch/NoteBooks at master · NVIDIA/clara-train-examples · GitHub

I am not sure if you are using V4 or V3.1 but in both releases there is a concept of caching / smart cache. You need to lower the number of volumes you are caching. in V4 you would see progress bar showing number of volumes being cached.
In your example you would have to use smart cache to cache a subset of your data then use the replace parameter to replace a % of the cached set at each epoch.
Please consult our documentation depending on the release you are using

Hope this helps

Thank you for the reply. I am experimenting with the caching, but not having any luck so far. My understanding is that caching will allow me to keep data in memory to reduce read times, but the issue I am having is too much data being stored in RAM.

As my dataset grows in number of samples, more memory is used during each epoch. The behavior I desire is such that during each iteration, only the data relevant to that iteration is stored in memory.

For example:
I start a training with one 3GB scan: ~33GB RAM peak before first iteration
I start a training with two 3GB scans: ~63GB RAM peak before first iteration
I start a training with 4 3GB scans: ~108GB RAM peak before first iteration

Ideally, regardless of the number of scans, peak ram utilization would remain more consistent.

edit: I also just noticed that caching pipelines can’t be used during validation, so it won’t help my issue.

Hi
Could you clarify:

  • which release are you using ? V3.1 or V4
  • which dataloader are you using ? please attach your train config

I think you are using full caching that is why you run our of memory. If you see more and more memory being used each epoch there might me some sort of data leakage happening in one of the transforms. You could remove all augmentation to narrow down the issue.

Hope that helps