Boosting Data Ingest Throughput with GPUDirect Storage and RAPIDS cuDF

jwitsoe · May 27, 2022, 9:45pm

Originally published at: https://developer.nvidia.com/blog/boosting-data-ingest-throughput-with-gpudirect-storage-and-rapids-cudf/

Learn how RAPIDS cuDF accelerates data science with the help of GPUDirect Storage. Dive into the techniques that minimize the time to upload data to the GPU

gkimball · August 9, 2022, 9:43pm

Thank you for checking out our article. Since the publication, we added a “read throughput” analysis based on the benchmarks in this report. The analysis shows that for high-cardinality data with simple data types, we exceed 5 GiB/s end-to-end read throughput with GDS.

user157267 · January 25, 2025, 10:02pm

Hi,
I have a question about the additional figure labeled with “GDS” and “libcudf 22.04.”

Was the read throughput benchmark conducted directly on libcudf’s public API? If so, how does libcudf measure raw I/O performance such as the tool like GDS’s GDSIO?

My understanding is that cuDF’s public API in libcudf is primarily for reading file formats like CSV and so on, and isn’t specifically designed to benchmark raw I/O performance. Internally, libcudf appears to use KvikIO (then GDS) for device-level read/write operations. Could you confirm if my understanding is correct?

Thank you!

user157267 · January 25, 2025, 10:05pm

I have another question regarding Figure 2, specifically about the I/O pool and queue. This question should be very similar as my last one.

My understanding is that the figure and the concept of the read_async design fall under the scope of KvikIO. Can you confirm if the I/O thread pool is part of KvikIO, rather than libcudf?

Additionally, I assume the benchmark in the Figure 2 is purely evaluating KvikIO (or GDSIO) and not directly tied to cuDF/libcudf. This is what I asked in the first question. From the libcudf code I see, cuDF depends on KvikIO’s file_handle but doesn’t explicitly include an I/O thread pool directly in libcudf.

Thank you!

user157267 · January 27, 2025, 11:54am

Thank you, Gregory Kimball, for liking both of my questions.
I also found the KvikIO Runtime Settings documentation here: KvikIO Runtime Settings. With these settings, it should be possible to reproduce Figure 2.

Thank all authors very much for this blog—it’s the only NVIDIA blog I’ve found that explains the design of KvikIO in detail. It is very helpful!

gkimball · January 27, 2025, 8:22pm

Thank you user157267 for your thoughtful questions. The read throughput benchmark was collected using “read_parquet” in libcudf’s public API, with the throughput numbers representing file read to cudf table end-to-end time rather than raw IO performance. I believe the “additional figure” throughput numbers were based on file size over end-to-end processing time for parquet files with low compression ratio.

Thank you for asking about the IO thread pool. Back in March 2022, I believe the benchmarks in Figure 2 used a threadpool in libcudf, and later in 2022 this threadpool was upstreamed to KvikIO. We are about to deprecate the libcudf threadpool in Use KvikIO to enable file's fast host read and host write by kingcrimsontianyu · Pull Request #17764 · rapidsai/cudf · GitHub, and then we will fully rely on KvikIO for cuDF’s cuFile integration.

Please feel free to reach me and my team anytime on our public Slack workspace “rapids.ai”: Slack.

user157267 · January 28, 2025, 9:20am

Thanks, I understand your point. I just reviewed the recent PR you mentioned, and I see how cuDF is transitioning from the kvikio & cufile stage to a purely kvikio-based approach.

For future readers:
When this blog was written, kvikio was not fully available, so cuDF primarily used cufile with a thread pool for I/O performance. This design principle is the same in kvikio, which has since been upstreamed as a standalone library from cudf. This transition is also the reason behind the questions I asked earlier.

Topic		Replies	Views
cufftXt batch 1D GPU-Accelerated Libraries	12	2149	October 15, 2019
Getting the best performance from NVIDIA GPUDirect storage APIs (batch io) GPU-Accelerated Libraries gds	5	800	June 29, 2023
GPUDirect Storage: A Direct Path Between Storage and GPU Memory Technical Blog	7	1068	March 22, 2022
Maximizing Performance with Massively Parallel Hash Maps on GPUs Technical Blog	1	498	February 5, 2024
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1184	May 14, 2019
GDS performance not as expected GPU-Accelerated Libraries gds	5	1398	July 9, 2023
RAPIDS on Databricks: A Guide to GPU-Accelerated Data Processing Technical Blog	2	192	May 14, 2024
CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing Technical Blog	18	1194	August 15, 2023
Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs Technical Blog	17	1833	June 26, 2023
Pandas DataFrame Tutorial - Beginner's Guide to GPU Accelerated DataFrames in Python Technical Blog	1	742	October 9, 2021

Boosting Data Ingest Throughput with GPUDirect Storage and RAPIDS cuDF

Related topics