GPUDirect Storage: A Direct Path Between Storage and GPU Memory

Originally published at:

Keeping GPUs Busy As AI and HPC datasets continue to increase in size, the time spent loading data for a given application begins to place a strain on the total application’s performance. When considering end-to-end application performance, fast GPUs are increasingly starved by slow I/O. I/O, the process of loading data from storage to GPUs…

An application of PostgreSQL database. SSD-to-GPU Direct SQL exactly accelerate I/O intensive workloads using GPU!
The slides below is at PGconf.EU 2018. Please check it out.

This article does a good job of pointing out the bottleneck of PCIe connections between CPU and GPU that causes access of CPU memory to/from GPU to be at the speed of an I/O on servers using x86 CPU processor architectures. Thankfully, NVIDIA has partnered with IBM to create servers (IBM AC922 with V100 GPUs) that overcome this bottleneck by having NVLink 2.0 between CPUs and GPUs, thereby allowing the CPU memory to be used by GPUs via NVLink. This allows GPU models/data to grow up to about 2TB of memory on a single server and using GPUDirect RDMA over inifiniband to scale to many servers. It would be interesting to put that throughput on figure 6. If I am interpreting that figure correctly as bidirectional bandwidth, the line for the NVIDIA/IBM server with 4 GPUs would be 150GB/s (about three times as much as GPUDirect IO). Did I interpret that figure correctly? For inclusion on Figure 5, I believe that the latency for NVLink 2.0 access to/from GPU to IBM Power9 CPU is somewhere under 10 microseconds. Is the Y axis in figure 5 seconds, milliseconds, or microseconds? For I/O, thankfully, the NVIDIA/IBM server also provides Gen4 PCIe to double that throughput as well as provide a faster interconnect for multi-node workloads using GPUDirect RDMA over Infiniband. And it's important to not forget workload/data management provided by Spectrum LSF that can preload data to where it needs to be before a job starts, allowing data to be stored on low-cost storage until need, then moved to NVMe devices. Is there another article that includes the benefits of CPU to GPU NVLink 2.0 and PCIe Gen4 when combined with GPUDIrect I/O?

While I understand the direct DMA to and from GPU memory part, it is not clear to me how this can achieve the advantages currently provided by file system cache. Also it is not clear how a programmer avoid typical problems like having a copy of same file data in each GPU's memory and keeping them in sync etc. What kind of tools are provided for debugging by Nvidia? May be I think old fashioned way but one has to always assume programmers will make mistakes and if you let hundreds of GPU cores do I/O operations that can get real tough to debug.

This is a very impressive technical work!

It is a bit unfortunate however that this blog does not cite or compare with prior art from my group which not only shows how to enable peer to peer DMA from GPU to/from NVMe including RAID, but also integrated it all with CPU page cache, combined the two in the best way and made it transparent for users via POSIX FS calls. The project called SPIN has been open source since 2017.

This is really impressive, thanks for this post and for including such detailed level of performance measurements. I had the pleasure of talking with CJ at GTC this year. I find it interesting that mmap() and faulting in to GPU memory using UVA does not perform better than the mmap() + cudaMemcpy(), but I guess it may have something to do with the depth of the PCIe tree? Is there a plan to distribute GPUDirect Storage as an additional API, or adding it to CUDA, or is this conceived as a DGX feature only?

I guess this may be the right place to post it, but I've implemented a NVMe-driver library for CUDA applications (that can run both on CPU and GPU) previously, and did a talk about it on GTC 2019. That work was heavily inspired by @kaigaikohei:disqus's poster on GTC a couple of years back. The software is open source, and works on a single machine, but can also work across multiple PCIe root complexes by using Dolphin's NTB technology.