Making Apache Spark More Concurrent

Originally published at:

Apache Spark provides capabilities to program entire clusters with implicit data parallelism. With Spark 3.0 and the open source RAPIDS Accelerator for Spark, these capabilities are extended to GPUs. However, prior to this work, all CUDA operations happen in the default stream, causing implicit synchronization and not taking advantage of concurrency on the GPU. In…

This was an interesting project going up and down the stack to get everything working in a multi-threaded environment on the host side. It was particularly challenging to get the memory pool to work efficiently with per-thread default stream without causing too much fragmentation. Inspirations from jemalloc, Hoard, tcmalloc, and the classic Paul R. Wilson paper from 1995.

If you have any questions or comments, let us know.