Efficiently Scaling Polars GPU Parquet Reader

Originally published at: Efficiently Scaling Polars GPU Parquet Reader | NVIDIA Technical Blog

When working with large datasets, the performance of your data processing tools becomes critical. Polars, an open-source library for data manipulation known for its speed and efficiency, offers a GPU-accelerated backend powered by cuDF that can significantly boost performance.  However, to fully leverage the power of the Polars GPU backend, it’s essential to optimize the…

Hi,

Thank you for the blog—it was very interesting! I have a technical question about the y-axis in Figures 2 and 3: how is the throughput calculated? In Figure 3, the throughput suddenly reaches a maximum of ~100 GB/s, which surprises me.
So I wonder how this metric is calculated, especially since the blog doesn’t mention the GPU type. Without that context, it’s unclear how a single GPU could achieve 100 GB/s, as this exceeds typical single-GPU PCIe bandwidth limits.

Could you clarify these details? Thank you for your time!

Ping again.

Sorry for pinging again.

Hi @user157267 , Apologies for the late reply.
The throughput in Figures 2 and 3 is computed as:

So for each point, we take the configured scale factor (i.e., data size) and divide it by the measured time to read it.

The hardware it was run on is H100 which has PCIe Gen5 that goes up to 128GB/s.

Hi @premsagargali

Thanks for the reply, but I’m now more confused.
First, the H100’s PCIe bandwidth of 128 GB/s is bidirectional, so the maximum D2H bandwidth is around 64 GB/s.
Second, regarding the formula you sent: could the denominator be the uncompressed inmemory size of the dataset? It seems impossible to achieve more than 64 GB/s if the denominator is based on the read size from the other side of PCIe, e.g., the Parquet file size on SSD.

Hi @user157267

Sorry for the confusion. The throughput is not calculated with the Parquet file sizes, but with the raw data size. Parquet encoding and compression allow us to copy less data over PCIe and get better performance compared to ingesting raw data from disk. So, the Scale factors are the size of the data once they are decompressed and decoded into the memory.

1 Like

Thank you, and that’s clear now. I refer to this as “effective bandwidth”, which includes the compression ratio and does not have an upper limit. For example, the encoded compression ratio could easily reach 1000×.