High ram usage with tlt unet training

Hi, I am trying to train a resnet18 based unet model with tlt on a machine with 2 gpus. I have noticed that the ram consumption in the first steps goes very high, causing the job to be killed. I was unable to use a per-gpu batch of size 3 of 640x640 sized images, the job was killed. I had to change to 1-gpu mode, and with a dataset of 18000 images it used up 20 GB ram. I noticed that with datasets with smaller amount of images the usable batch size with same resolution is greater, I was able to get batch 5 with 2 gpus with a smaller one (~2000 images). However, a batch of 6 on the same 32GB RAM machine with same dataset will cause out-of-memory. Ccould there be some memory leakage in the dataloader?

May I know which dataset do you train? I need to check if I can reproduce.
BTW, which gpu did you use?

the larger one is Mapillary Vistas, the smaller one was Aeroscapes GitHub - ishann/aeroscapes: Aerial Semantic Segmentation Benchmark
in both cases images are resized to 640x640

we use GTX 1080ti

Thanks for the info.
If possible, please the full log with us.

here is the log for training on mapillary vistas with --gpus 2. the job gets killed after exceeding 32 gb ram. the same command without --gpus 2 runs just fine.
log.txt (88.6 KB)

I 've also just tried to run the same 1 gpu training but with batch 7 and it also got killed after exceeding 32 gb

If possible, could you please resize images to smaller ones and retry?
BTW, in Maskrcnn, for 3.0_dp version, see https://forums.developer.nvidia.com/t/tlt-train-maskrcnn-model-with-mapillary-vistas-dataset-failed-on-cuda-error-out-of-memory-out-of-memory , it is recommended to resize to 1/8 or smaller for Mapillary Vistas dataset.

You see, it is of course possible to resize the images, but then there is the following problem that if I would like to train on 3 gpus or ith a larger dataset, I would have to resize them even more. The thing is that training with batch N of images of some size fits in RAM when training on 1 gpu and does not fit when training on 2. It seems that the whole dataset is being loaded into RAM at the beginning, and that this is done repeatedly for every gpu. I guess that in this case dowscaling would not be a solution but a hotfix only.

Could you share the training spec? If possible, please share the 2000 resized images(Mapillary Vistas) along with their json file. I want to reproduce your error. Thanks a lot.

I am am not sure what exactly you ask for. I might not be allowed by Mapillary’s license to share their data, but I think I can share the other dataset of couple of thousand of images of same resolution. You could also add some copies of them just to increase the size and see how its affecting the problem. In any case there are no json files, as this is unet semantic segmentation training. I can share the png annotations for the images

Yes, if possible, please share the same resolution images with me.
Please share your training spec too. Thanks.

Ignore my request for the training spec. I get it from your previous log. Sorry for the inconvenience.

Please run with latest docker nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 .
I run with 18000 Mapillary Vistas training images without any error in one gpu(GeForce GTX 1080 Ti).

BTW, Unet in 3.0-py3 version does not need to resize images/masks.

as I mentioned in the post with the log, that training spec was used for 2 gpu training. in 1 gpu mode, that training runs fine, which you confirmed in your trial. If you could try rerunning the experiment with 2 gpus, it would be great. Alternatively you could increase batch size from 5 as in the spec to 7, still using 1 gpu. That also lead me to a ram OOM. Did you monitor RAM usage? In the case of 2000 images training it used 20 GB RAM, which did not get killed, but presumably is much larger than expected if the data is not loaded into RAM alltogether.

I think that the docker image was the latest version by me as well.

here i attach the log with the spec for training on 1 gpu with batch 7. it gets killed.
log_b7.txt (45.0 KB)

I think you are using 3.0-dp-py3 docker. The latest 3.0-py3 docker is released only three days ago. See Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC
Please try with it.

ok I will try and confirm

For every resolution, there is a max batch size they can fit in 1 GPU.
If we want a greater bs, we need to reduce the resolution a bit.

this is understandble, but I am pretty sure GPU memory is not exceeded but the RAM is.

btw I am trying to run with an updated nvidia-tlt and meet another problem

is there an instruction how to run training from a container? at the moment all manuals work through the python wrapper nvidia-tlt