I have some issues during training two-tower model on Nvidia Merlin.
Instance: AWS g4dn.2xlarge with with 32 GB of ram and 16 GB of T4 GPU memory
Container: Merlin Tensorflow 22.11
Issues
When I training a two tower model, a script log show that tensorflow could only see 7680 MB of T4 memory instead of full 15360 MB.
During the post-process stage two-tower training, a Cuda OOM error occured during Extract and save User features from 05-Retrieval-Model.ipynb in merlin-model repository.
Here is a startup log and error log during training a two-tower model via python script.
Both RAPIDs cudf and Tensorflow reserve GPU memory pools to make dataframe buffer allocations performant. If you import rmm the RAPIDS Memory Manager before cudf and tensorflow you can customize the memory pool sizes. Tensorflow also has ability to specify the percentage of memory reserved on the GPU.