How did you handle the large data sizes and scale training effectively? Was GPU memory enough?
Although the dataset contained only ~3.5M sessions and ~1.5M products, the possible combinations are trillions of options. We built a reranker for the top100 products for each session, resulting in 350M rows. For each session x product pair, we added 100s of features - it requires ~500 GB memory. First, we sliced the data in a smart way. Our scripts can process each session independent of other sessions. We splitted the dataset by sessions and created 100 small chunks for each language and iterated over each chunk. We used RAPIDs cuDF to accelerate the feature engineering and run multiple chunks in parallel (each GPU process one chunk). Training the reranker required cudf_dask, which distributes datasets across multiple GPUs and XGB has support of DASK. We were able to train XGB with 8x GPUs, using 256B GPU memory
Thanks for the answers.