Although the dataset contained only ~3.5M sessions and ~1.5M products, the possible combinations are trillions of options. We built a reranker for the top100 products for each session, resulting in 350M rows. For each session x product pair, we added 100s of features - it requires ~500 GB memory. First, we sliced the data in a smart way. Our scripts can process each session independent of other sessions. We splitted the dataset by sessions and created 100 small chunks for each language and iterated over each chunk. We used RAPIDs cuDF to accelerate the feature engineering and run multiple chunks in parallel (each GPU process one chunk). Training the reranker required cudf_dask, which distributes datasets across multiple GPUs and XGB has support of DASK. We were able to train XGB with 8x GPUs, using 256B GPU memory
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| big data > device memory | 1 | 6507 | December 8, 2009 | |
| Not enough GPU memory? Not a problem | 0 | 2220 | January 6, 2021 | |
| GPU workload with large datasets | 2 | 960 | October 29, 2015 | |
| Driving XGBoost in the GPU Fastlane | 0 | 1270 | February 9, 2021 | |
| GPU-Accelerated Spark XGBoost - A Major Milestone on the Road to Large-Scale AI | 0 | 255 | August 21, 2022 | |
| 3 pandas Workflows That Slowed to a Crawl on Large Datasets—Until We Turned on GPUs | 1 | 12 | July 18, 2025 | |
| How to handle big data with a machine learning algorithm on CUDA device if whole do not fit into De | 1 | 806 | June 21, 2013 | |
| Massive Speed-Ups for Massive Operations at Capital One | 0 | 948 | January 27, 2021 | |
| RAPIDS cuDF Unified Memory Accelerates pandas up to 30x on Large Datasets | 1 | 49 | August 9, 2024 | |
| TITAN RTX EDU Discount Now Available | 0 | 266 | August 21, 2022 |