Issues about tensorflow allocated memory and Cuda OOM during post-process of two tower training on Nvidia Merlin

I have some issues during training two-tower model on Nvidia Merlin.

Instance: AWS g4dn.2xlarge with with 32 GB of ram and 16 GB of T4 GPU memory
Container: Merlin Tensorflow 22.11

Issues

  1. When I training a two tower model, a script log show that tensorflow could only see 7680 MB of T4 memory instead of full 15360 MB.
  2. During the post-process stage two-tower training, a Cuda OOM error occured during Extract and save User features from 05-Retrieval-Model.ipynb in merlin-model repository.

Here is a startup log and error log during training a two-tower model via python script.

oom_error.txt (18.9 KB)

Hello Pannavatter, to start with, it would be helpful to get more information on the current system.

Hence, could you share :

  1. the output of nvidia-smi command
  2. the size of your data/model

Both RAPIDs cudf and Tensorflow reserve GPU memory pools to make dataframe buffer allocations performant. If you import rmm​ the RAPIDS Memory Manager before cudf and tensorflow you can customize the memory pool sizes. Tensorflow also has ability to specify the percentage of memory reserved on the GPU.

Example of setting memory pool size.

import rmm
>>> pool = rmm.mr.PoolMemoryResource(
...     rmm.mr.CudaMemoryResource(),
...     initial_pool_size=2**30,
...     maximum_pool_size=2**32
... )
>>> rmm.mr.set_current_device_resource(pool)

Link to the guide for RMM programming https://docs.rapids.ai/api/rmm/stable/basics.html