I am working on training a two-tower model with LazyAdam optimizer (based on this notebook). with a size of around 100GB, millions of unique users, and thousands of items of the processed dataset.
I can start training a model, however, GPU memory usage fluctuates during training, but somehow steadily increasing until the GPU is out of memory.
Are there any suggestions or configurations in Merlin that we can adjust to prevent an OOM issue?
GPU: Nvidia GeForce RTX 2080 Ti with 11264 MB of Memory
Container: merlin-tensorflow v22.10
Messages Log During Training:
2022-11-23 08:14:08.063410: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-23 08:14:08.063792: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-23 08:14:08.063927: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-23 08:14:08.177133: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-23 08:14:08.177831: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-23 08:14:08.178009: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-23 08:14:08.178140: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-23 08:14:08.998165: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-23 08:14:08.998345: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-23 08:14:08.998482: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-23 08:14:08.998568: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
2022-11-23 08:14:08.998660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5632 MB memory: → device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn’t match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn’t match a supported "
2022-11-23 15:14:09 - Loading dataset
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.USER_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.USER: ‘user’>, <Tags.ID: ‘id’>].
warnings.warn(
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: ‘item’>, <Tags.ID: ‘id’>].
warnings.warn(
2022-11-23 15:14:09 - Building the model
2022-11-23 15:14:09 - Training the model
Epoch 1/3
2022-11-23 15:15:45 - The sampler InBatchSampler returned no samples for this batch.
2022-11-23 16:58:16.068804: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:288] gpu_async_0 cuMemAllocAsync failed to allocate 268468224 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)
Reported by CUDA: Free memory/Total memory: 268369920/11552227328
2022-11-23 16:58:16.090486: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:293] Stats: Limit: 5905580032
InUse: 6603338491
MaxInUse: 6883473861
NumAllocs: 28871387
MaxAllocSize: 1597946880
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2022-11-23 16:58:16.090528: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:56] Histogram of current allocation: (allocation_size_in_bytes, nb_allocation_of_that_sizes), …;
2022-11-23 16:58:16.090792: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 1, 3
2022-11-23 16:58:16.090806: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 4, 69
2022-11-23 16:58:16.090815: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 8, 11
2022-11-23 16:58:16.090824: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 16, 1
2022-11-23 16:58:16.090831: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 128, 3
2022-11-23 16:58:16.090844: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 204, 1
2022-11-23 16:58:16.090852: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 312, 1
2022-11-23 16:58:16.090859: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 512, 10
2022-11-23 16:58:16.090866: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 860, 1
2022-11-23 16:58:16.090873: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 1028, 1
2022-11-23 16:58:16.090879: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 2080, 3
2022-11-23 16:58:16.090887: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 2496, 3
2022-11-23 16:58:16.090895: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 13036, 1
2022-11-23 16:58:16.090903: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 20160, 3
2022-11-23 16:58:16.090909: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 30724, 1
2022-11-23 16:58:16.090919: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 32768, 40
2022-11-23 16:58:16.090927: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 40448, 5
2022-11-23 16:58:16.090933: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 57856, 5
2022-11-23 16:58:16.090940: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 312864, 2
2022-11-23 16:58:16.090947: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 1941216, 3
2022-11-23 16:58:16.090954: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 2588672, 1
2022-11-23 16:58:16.090961: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 2949504, 2
2022-11-23 16:58:16.090970: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 3702784, 1
2022-11-23 16:58:16.090977: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 4194304, 2
2022-11-23 16:58:16.090986: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 50331648, 1
2022-11-23 16:58:16.090994: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 51380224, 1
2022-11-23 16:58:16.091001: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 52428800, 1
2022-11-23 16:58:16.091008: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 67108864, 1
2022-11-23 16:58:16.091015: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 268468224, 4
2022-11-23 16:58:16.091022: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 1597946880, 3
2022-11-23 16:58:16.091281: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:90] CU_MEMPOOL_ATTR_RESERVED_MEM_CURRENT: 6777995264
2022-11-23 16:58:16.091294: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:92] CU_MEMPOOL_ATTR_USED_MEM_CURRENT: 6117920019
2022-11-23 16:58:16.091310: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:93] CU_MEMPOOL_ATTR_RESERVED_MEM_HIGH: 7180648448
2022-11-23 16:58:16.091320: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:94] CU_MEMPOOL_ATTR_USED_MEM_HIGH: 6409232695
2022-11-23 16:58:16.194863: F tensorflow/stream_executor/cuda/cuda_driver.cc:152] Failed setting context: CUDA_ERROR_NOT_PERMITTED: operation not permitted
Memory used during training: