Accelerating Apache Parquet Scans on Apache Spark with GPUs

Originally published at: https://developer.nvidia.com/blog/accelerating-apache-parquet-scans-on-apache-spark-with-gpus/

As data sizes have grown in enterprises across industries, Apache Parquet has become a prominent format for storing data. Apache Parquet is a columnar storage format designed for efficient data processing at scale. By organizing data by columns rather than rows, Parquet enables high-performance querying and analysis, as it can read only the necessary columns…

This effort ended up being a significant improvement for multiple customer workloads on GPU as Parquet scans are key for large-scale workloads on Apache Spark. If you have any questions or comments about the optimization or using GPUs for Apache Spark jobs, let us know.