GPUDirect Storage: A Direct Path Between Storage and GPU Memory

jwitsoe · August 6, 2019, 1:01pm

Originally published at: https://developer.nvidia.com/blog/gpudirect-storage/

Keeping GPUs Busy As AI and HPC datasets continue to increase in size, the time spent loading data for a given application begins to place a strain on the total application’s performance. When considering end-to-end application performance, fast GPUs are increasingly starved by slow I/O. I/O, the process of loading data from storage to GPUs…

anon4987499 · August 7, 2019, 2:08pm

An application of PostgreSQL database. SSD-to-GPU Direct SQL exactly accelerate I/O intensive workloads using GPU!
The slides below is at PGconf.EU 2018. Please check it out.
https://www.slideshare.net/...

anon28734788 · August 14, 2019, 2:42pm

Hello,
This article does a good job of pointing out the bottleneck of PCIe connections between CPU and GPU that causes access of CPU memory to/from GPU to be at the speed of an I/O on servers using x86 CPU processor architectures. Thankfully, NVIDIA has partnered with IBM to create servers (IBM AC922 with V100 GPUs) that overcome this bottleneck by having NVLink 2.0 between CPUs and GPUs, thereby allowing the CPU memory to be used by GPUs via NVLink. This allows GPU models/data to grow up to about 2TB of memory on a single server and using GPUDirect RDMA over inifiniband to scale to many servers. It would be interesting to put that throughput on figure 6. If I am interpreting that figure correctly as bidirectional bandwidth, the line for the NVIDIA/IBM server with 4 GPUs would be 150GB/s (about three times as much as GPUDirect IO). Did I interpret that figure correctly? For inclusion on Figure 5, I believe that the latency for NVLink 2.0 access to/from GPU to IBM Power9 CPU is somewhere under 10 microseconds. Is the Y axis in figure 5 seconds, milliseconds, or microseconds? For I/O, thankfully, the NVIDIA/IBM server also provides Gen4 PCIe to double that throughput as well as provide a faster interconnect for multi-node workloads using GPUDirect RDMA over Infiniband. And it's important to not forget workload/data management provided by Spectrum LSF that can preload data to where it needs to be before a job starts, allowing data to be stored on low-cost storage until need, then moved to NVMe devices. Is there another article that includes the benefits of CPU to GPU NVLink 2.0 and PCIe Gen4 when combined with GPUDIrect I/O?

anon24583660 · August 15, 2019, 12:48am

While I understand the direct DMA to and from GPU memory part, it is not clear to me how this can achieve the advantages currently provided by file system cache. Also it is not clear how a programmer avoid typical problems like having a copy of same file data in each GPU's memory and keeping them in sync etc. What kind of tools are provided for debugging by Nvidia? May be I think old fashioned way but one has to always assume programmers will make mistakes and if you let hundreds of GPU cores do I/O operations that can get real tough to debug.

anon19423687 · August 19, 2019, 6:50pm

This is a very impressive technical work!

It is a bit unfortunate however that this blog does not cite or compare with prior art from my group which not only shows how to enable peer to peer DMA from GPU to/from NVMe including RAID, but also integrated it all with CPU page cache, combined the two in the best way and made it transparent for users via POSIX FS calls. The project called SPIN has been open source since 2017.

https://www.usenix.org/conf...
https://github.com/acsl-tec...

anon13936343 · September 3, 2019, 12:39pm

This is really impressive, thanks for this post and for including such detailed level of performance measurements. I had the pleasure of talking with CJ at GTC this year. I find it interesting that mmap() and faulting in to GPU memory using UVA does not perform better than the mmap() + cudaMemcpy(), but I guess it may have something to do with the depth of the PCIe tree? Is there a plan to distribute GPUDirect Storage as an additional API, or adding it to CUDA, or is this conceived as a DGX feature only?

I guess this may be the right place to post it, but I've implemented a NVMe-driver library for CUDA applications (that can run both on CPU and GPU) previously, and did a talk about it on GTC 2019. That work was heavily inspired by @kaigaikohei:disqus's poster on GTC a couple of years back. The software is open source, and works on a single machine, but can also work across multiple PCIe root complexes by using Dolphin's NTB technology.

https://developer.nvidia.co...
https://github.com/enfiskut...

rakshit.patel · March 18, 2022, 7:49am

I have GTX 1650 GPU. DOES GDS WORK ON THAT?

sajoshi · March 22, 2022, 4:29am

Bullet 1 here describes supported GPUs

Topic		Replies	Views
Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs Technical Blog	17	1810	June 26, 2023
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2652	April 11, 2019
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2132	January 18, 2023
Device Memory Bandwidth CUDA Programming and Performance	17	8027	January 17, 2018
Unified Memory in CUDA 6 Technical Blog	87	1892	August 16, 2019
cufftXt batch 1D GPU-Accelerated Libraries	12	2142	October 15, 2019
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1164	May 14, 2019
Beyond GPU Memory Limits with Unified Memory on Pascal Technical Blog	15	875	March 11, 2022
Improving GPU Memory Oversubscription Performance Technical Blog	4	828	November 2, 2021
NVIDIA Grace Hopper Superchip Architecture In-Depth Technical Blog	9	1072	June 17, 2024

GPUDirect Storage: A Direct Path Between Storage and GPU Memory

Related topics