High ram usage with tlt unet training

ksokolov · June 3, 2021, 2:25pm

Hi, I am trying to train a resnet18 based unet model with tlt on a machine with 2 gpus. I have noticed that the ram consumption in the first steps goes very high, causing the job to be killed. I was unable to use a per-gpu batch of size 3 of 640x640 sized images, the job was killed. I had to change to 1-gpu mode, and with a dataset of 18000 images it used up 20 GB ram. I noticed that with datasets with smaller amount of images the usable batch size with same resolution is greater, I was able to get batch 5 with 2 gpus with a smaller one (~2000 images). However, a batch of 6 on the same 32GB RAM machine with same dataset will cause out-of-memory. Ccould there be some memory leakage in the dataloader?

Morganh · June 4, 2021, 2:35pm

May I know which dataset do you train? I need to check if I can reproduce.
BTW, which gpu did you use?

ksokolov · June 4, 2021, 2:40pm

the larger one is Mapillary Vistas, the smaller one was Aeroscapes GitHub - ishann/aeroscapes: Aerial Semantic Segmentation Benchmark
in both cases images are resized to 640x640

we use GTX 1080ti

Morganh · June 4, 2021, 2:44pm

Thanks for the info.
If possible, please the full log with us.

ksokolov · June 5, 2021, 10:01am

here is the log for training on mapillary vistas with --gpus 2. the job gets killed after exceeding 32 gb ram. the same command without --gpus 2 runs just fine.
log.txt (88.6 KB)

ksokolov · June 5, 2021, 10:19am

I 've also just tried to run the same 1 gpu training but with batch 7 and it also got killed after exceeding 32 gb

Morganh · June 9, 2021, 7:55am

If possible, could you please resize images to smaller ones and retry?
BTW, in Maskrcnn, for 3.0_dp version, see https://forums.developer.nvidia.com/t/tlt-train-maskrcnn-model-with-mapillary-vistas-dataset-failed-on-cuda-error-out-of-memory-out-of-memory , it is recommended to resize to 1/8 or smaller for Mapillary Vistas dataset.

ksokolov · June 9, 2021, 10:13am

You see, it is of course possible to resize the images, but then there is the following problem that if I would like to train on 3 gpus or ith a larger dataset, I would have to resize them even more. The thing is that training with batch N of images of some size fits in RAM when training on 1 gpu and does not fit when training on 2. It seems that the whole dataset is being loaded into RAM at the beginning, and that this is done repeatedly for every gpu. I guess that in this case dowscaling would not be a solution but a hotfix only.

Morganh · June 10, 2021, 12:46am

Could you share the training spec? If possible, please share the 2000 resized images(Mapillary Vistas) along with their json file. I want to reproduce your error. Thanks a lot.

ksokolov · June 11, 2021, 6:58am

I am am not sure what exactly you ask for. I might not be allowed by Mapillary’s license to share their data, but I think I can share the other dataset of couple of thousand of images of same resolution. You could also add some copies of them just to increase the size and see how its affecting the problem. In any case there are no json files, as this is unet semantic segmentation training. I can share the png annotations for the images

Morganh · June 11, 2021, 7:05am

Yes, if possible, please share the same resolution images with me.
Please share your training spec too. Thanks.

Morganh · June 11, 2021, 8:05am

Ignore my request for the training spec. I get it from your previous log. Sorry for the inconvenience.

Morganh · June 13, 2021, 2:58am

Please run with latest docker nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 .
I run with 18000 Mapillary Vistas training images without any error in one gpu(GeForce GTX 1080 Ti).

BTW, Unet in 3.0-py3 version does not need to resize images/masks.

ksokolov · June 13, 2021, 10:44am

as I mentioned in the post with the log, that training spec was used for 2 gpu training. in 1 gpu mode, that training runs fine, which you confirmed in your trial. If you could try rerunning the experiment with 2 gpus, it would be great. Alternatively you could increase batch size from 5 as in the spec to 7, still using 1 gpu. That also lead me to a ram OOM. Did you monitor RAM usage? In the case of 2000 images training it used 20 GB RAM, which did not get killed, but presumably is much larger than expected if the data is not loaded into RAM alltogether.

I think that the docker image was the latest version by me as well.

ksokolov · June 13, 2021, 11:01am

here i attach the log with the spec for training on 1 gpu with batch 7. it gets killed.
log_b7.txt (45.0 KB)

Morganh · June 13, 2021, 11:02am

I think you are using 3.0-dp-py3 docker. The latest 3.0-py3 docker is released only three days ago. See Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC
Please try with it.

ksokolov · June 13, 2021, 11:02am

ok I will try and confirm

Morganh · June 15, 2021, 8:00am

For every resolution, there is a max batch size they can fit in 1 GPU.
If we want a greater bs, we need to reduce the resolution a bit.

ksokolov · June 15, 2021, 8:05am

this is understandble, but I am pretty sure GPU memory is not exceeded but the RAM is.

btw I am trying to run with an updated nvidia-tlt and meet another problem

ksokolov · June 15, 2021, 11:04am

is there an instruction how to run training from a container? at the moment all manuals work through the python wrapper nvidia-tlt