Beyond GPU Memory Limits with Unified Memory on Pascal

jwitsoe · December 13, 2016, 10:11am

Originally published at: https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/

Figure 1: Dimethyl ether jet simulations designed to study complex new fuels. Image courtesy of the Center for Exascale Simulation of Combustion in Turbulence (ExaCT). Modern computer architectures have a hierarchy of memories of varying size and performance. GPU architectures are approaching a terabyte per second memory bandwidth that, coupled with high-throughput computational cores, creates…

anon96901432 · December 15, 2016, 10:52am

Can I take advantage of the Page Migration Engine in Pascal GPUs when using OpenCL? I think it would drastically simplify some parts of the code generator for my compiler, but it generates only OpenCL code.

anon15011306 · December 19, 2016, 4:03pm

Hi Troels. Currently the Page Migration Engine functionality is not supported in our OpenCL implementation.

anon96901432 · December 19, 2016, 4:56pm

That's a shame. Is there any hope of support being added in the future?

anon15011306 · December 21, 2016, 8:16pm

Sorry, at this time we do not have plans to add support for it.

anon6244879 · December 22, 2016, 6:49pm

Hi Nikolay. The unified memory techniques you applied to simulation is amazing thanks for your introduction. I have a question about the Pascal GPU, I'm a student and I cannot afford the expensive Pascal P100 GPU (even I know it has powerful double precision computing capabilities), so does Geforce series GPU support the unified memory features(such as GTX1080 GTX Titan X)? if yes that would be nice for me to some fast simulation prototyping even with single precision. Another question is currently I'm using Python to write simulation code, do you know if there's any possibilities or python packages which can utilize the unified memory features? Thanks!

anon15011306 · December 29, 2016, 7:34pm

Hi Liu. Thank you for your feedback. Any Pascal GPU supports the new Unified Memory features such as on-demand paging and GPU memory oversubscription, so you can definitely prototype your application on GTX 1080 or GTX Titan X. Regarding Python, there are two most popular packages out there: Numba http://numba.pydata.org/ and PyCuda https://mathema.tician.de/s.... Numba does not support Unified Memory allocations, but there is a way to use Unified Memory in PyCuda, see https://documen.tician.de/p....

anon56416734 · January 12, 2017, 8:48pm

Just to note - you can get access to a P100 server with NVlink enabled in the cloud for $5/hour. You could prototype on GTX1080 and then give a full-speed test on the cloud server.
Cloud link: https://power.jarvice.com (choose the Ubuntu Linux for Power8 with NVLink support).

anon9086193 · January 12, 2017, 9:31pm

Re Fig. 9 - hints and prefetching seem to harm P8+NVLink on small data sets on the left side of the graph...what's up?

anon15011306 · January 20, 2017, 1:20am

Thanks, good catch! This was mostly likely a measurement error or driver overhead. I reran the numbers with a more recent driver and updated the performance charts. The issue for small datasets had disappeared and now the prefetching approach is always better than the baseline. Moreover, the throughput of the largest test case increased significantly due to driver improvements related to page fault handling.

anon45371894 · May 26, 2017, 9:41am

Hello, Is there any limit on number for prefetch calls I can issue concurrently ? I am seeing some performance slow down after 64 calls. Am using cudaMemPrefetchAsync()

anon15011306 · May 30, 2017, 2:14am

Hi Vishal, I have never tested so many concurrent prefetches. Can you put your example test code on gist or email me details at "first initial last name at nvidia dot com"? I'd like to look into this.

anon84587350 · October 3, 2017, 12:23pm

Hi Nikolay, I am not sure which methodology to be used in my program i.e. UM or non-UM approach. In my application's pipeline most of the components are GPU kernel based / cudaAPI based. CPU does not come into picture for accessing this data in between this pipeline. So in such scenario, which type of memory i.e. UM or use cudaMalloc / device memory. I think device memory would be faster in between my pipeline, but I need to manage them using cudaSetDevice in case of multi-gpu environment. I hope for multi-GPU using cudaMallocManaged is simpler, but not sure about its performance impact as it will limited by PCIe bandwidth.

Can you please share your opinion on this and suggest better way of architecting multi-GPU use case with very minimal buffer access by CPU.

513738452 · March 9, 2022, 7:01am

Does this mean that gTX1050 using PASCAL architecture also supports the same functionality, i.e. GPU memory oversubscription?

nsakharnykh · March 11, 2022, 2:32pm

Yes, any Pascal+ GPU supports Unified Memory-based GPU memory oversubscription on Linux platforms. See this paragraph in the CUDA programming guide for detailed requirements: Programming Guide :: CUDA Toolkit Documentation

513738452 · March 11, 2022, 2:45pm

Thanks, your reply saves me from spending money on a much more expensive GPU

Topic		Replies	Views
Unified Memory for CUDA Beginners Technical Blog	46	2582	December 1, 2023
CUDA 8 Features Revealed Technical Blog	51	864	November 8, 2018
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1255	May 14, 2019
Unified Memory in CUDA 6 Technical Blog	87	1903	August 16, 2019
Improving GPU Memory Oversubscription Performance Technical Blog	5	861	June 5, 2025
Simplifying GPU Programming for HPC with the NVIDIA Grace Hopper Superchip Technical Blog	1	358	December 16, 2024
Combine OpenACC and Unified Memory for Productivity and Performance Technical Blog	0	331	August 25, 2020
Simplifying GPU Application Development with Heterogeneous Memory Management Technical Blog	0	398	August 22, 2023
Boosting Application Performance with GPU Memory Prefetching Technical Blog	7	1154	March 9, 2023
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204322	April 13, 2009

Beyond GPU Memory Limits with Unified Memory on Pascal

Related topics