GTC 2020 S21482
Presenters: Nikolay Sakharnykh,NVIDIA; Hao Gao,University of Illinois Urbana-Champaign
There are numerous optimized single-GPU join implementations (such as RAPIDS cuDF), but scaling out to multiple GPUs across multiple nodes is challenging. The repartitioned join approach is one of the most popular distributed join algorithms, featuring all-to-all exchange as the main communication pattern. We’ll show how to leverage UCX for efficient all-to-all implementation and demonstrate various optimization strategies, such as reusing communication buffers to speed up GPU-to-GPU transfers and overlapping compute with communications. The implementation is designed to reuse RAPIDS components for single-GPU, and scales to NVLINK and Infiniband systems. Our latest performance results demonstrate that a single DGX-2 can achieve 220 GB/s throughput for joining 8B/8B key-value pairs, while 18 DGX-1V nodes (144 GPUs) connected over IB achieve 503 GB/s, which is comparable with 244 CPU nodes (2K cores) in the best-known distributed CPU implementation.
Watch this session
Join in the conversation below.