Originally published at: https://developer.nvidia.com/blog/boosting-data-ingest-throughput-with-gpudirect-storage-and-rapids-cudf/
Learn how RAPIDS cuDF accelerates data science with the help of GPUDirect Storage. Dive into the techniques that minimize the time to upload data to the GPU
Thank you for checking out our article. Since the publication, we added a “read throughput” analysis based on the benchmarks in this report. The analysis shows that for high-cardinality data with simple data types, we exceed 5 GiB/s end-to-end read throughput with GDS.
Hi,
I have a question about the additional figure labeled with “GDS” and “libcudf 22.04.”
Was the read throughput benchmark conducted directly on libcudf’s public API? If so, how does libcudf measure raw I/O performance such as the tool like GDS’s GDSIO?
My understanding is that cuDF’s public API in libcudf is primarily for reading file formats like CSV and so on, and isn’t specifically designed to benchmark raw I/O performance. Internally, libcudf appears to use KvikIO (then GDS) for device-level read/write operations. Could you confirm if my understanding is correct?
Thank you!
I have another question regarding Figure 2, specifically about the I/O pool and queue. This question should be very similar as my last one.
My understanding is that the figure and the concept of the read_async
design fall under the scope of KvikIO. Can you confirm if the I/O thread pool is part of KvikIO, rather than libcudf?
Additionally, I assume the benchmark in the Figure 2 is purely evaluating KvikIO (or GDSIO) and not directly tied to cuDF/libcudf. This is what I asked in the first question. From the libcudf code I see, cuDF depends on KvikIO’s file_handle
but doesn’t explicitly include an I/O thread pool directly in libcudf.
Thank you!
Thank you, Gregory Kimball, for liking both of my questions.
I also found the KvikIO Runtime Settings documentation here: KvikIO Runtime Settings. With these settings, it should be possible to reproduce Figure 2.
Thank all authors very much for this blog—it’s the only NVIDIA blog I’ve found that explains the design of KvikIO in detail. It is very helpful!
Thank you user157267 for your thoughtful questions. The read throughput benchmark was collected using “read_parquet” in libcudf’s public API, with the throughput numbers representing file read to cudf table end-to-end time rather than raw IO performance. I believe the “additional figure” throughput numbers were based on file size over end-to-end processing time for parquet files with low compression ratio.
Thank you for asking about the IO thread pool. Back in March 2022, I believe the benchmarks in Figure 2 used a threadpool in libcudf, and later in 2022 this threadpool was upstreamed to KvikIO. We are about to deprecate the libcudf threadpool in Use KvikIO to enable file's fast host read and host write by kingcrimsontianyu · Pull Request #17764 · rapidsai/cudf · GitHub, and then we will fully rely on KvikIO for cuDF’s cuFile integration.
Please feel free to reach me and my team anytime on our public Slack workspace “rapids.ai”: Slack.
Thanks, I understand your point. I just reviewed the recent PR you mentioned, and I see how cuDF is transitioning from the kvikio & cufile
stage to a purely kvikio
-based approach.
For future readers:
When this blog was written, kvikio
was not fully available, so cuDF primarily used cufile
with a thread pool for I/O performance. This design principle is the same in kvikio
, which has since been upstreamed as a standalone library from cudf. This transition is also the reason behind the questions I asked earlier.