Beyond GPU Memory Limits with Unified Memory on Pascal

Originally published at:

Figure 1: Dimethyl ether jet simulations designed to study complex new fuels. Image courtesy of the Center for Exascale Simulation of Combustion in Turbulence (ExaCT). Modern computer architectures have a hierarchy of memories of varying size and performance. GPU architectures are approaching a terabyte per second memory bandwidth that, coupled with high-throughput computational cores, creates…

Can I take advantage of the Page Migration Engine in Pascal GPUs when using OpenCL? I think it would drastically simplify some parts of the code generator for my compiler, but it generates only OpenCL code.

Hi Troels. Currently the Page Migration Engine functionality is not supported in our OpenCL implementation.

That's a shame. Is there any hope of support being added in the future?

Sorry, at this time we do not have plans to add support for it.

Hi Nikolay. The unified memory techniques you applied to simulation is amazing thanks for your introduction. I have a question about the Pascal GPU, I'm a student and I cannot afford the expensive Pascal P100 GPU (even I know it has powerful double precision computing capabilities), so does Geforce series GPU support the unified memory features(such as GTX1080 GTX Titan X)? if yes that would be nice for me to some fast simulation prototyping even with single precision. Another question is currently I'm using Python to write simulation code, do you know if there's any possibilities or python packages which can utilize the unified memory features? Thanks!

Hi Liu. Thank you for your feedback. Any Pascal GPU supports the new Unified Memory features such as on-demand paging and GPU memory oversubscription, so you can definitely prototype your application on GTX 1080 or GTX Titan X. Regarding Python, there are two most popular packages out there: Numba and PyCuda Numba does not support Unified Memory allocations, but there is a way to use Unified Memory in PyCuda, see

Just to note - you can get access to a P100 server with NVlink enabled in the cloud for $5/hour. You could prototype on GTX1080 and then give a full-speed test on the cloud server.
Cloud link: (choose the Ubuntu Linux for Power8 with NVLink support).

Re Fig. 9 - hints and prefetching seem to harm P8+NVLink on small data sets on the left side of the graph...what's up?

Thanks, good catch! This was mostly likely a measurement error or driver overhead. I reran the numbers with a more recent driver and updated the performance charts. The issue for small datasets had disappeared and now the prefetching approach is always better than the baseline. Moreover, the throughput of the largest test case increased significantly due to driver improvements related to page fault handling.

Hello, Is there any limit on number for prefetch calls I can issue concurrently ? I am seeing some performance slow down after 64 calls. Am using cudaMemPrefetchAsync()

Hi Vishal, I have never tested so many concurrent prefetches. Can you put your example test code on gist or email me details at "first initial last name at nvidia dot com"? I'd like to look into this.

Hi Nikolay, I am not sure which methodology to be used in my program i.e. UM or non-UM approach. In my application's pipeline most of the components are GPU kernel based / cudaAPI based. CPU does not come into picture for accessing this data in between this pipeline. So in such scenario, which type of memory i.e. UM or use cudaMalloc / device memory. I think device memory would be faster in between my pipeline, but I need to manage them using cudaSetDevice in case of multi-gpu environment. I hope for multi-GPU using cudaMallocManaged is simpler, but not sure about its performance impact as it will limited by PCIe bandwidth.

Can you please share your opinion on this and suggest better way of architecting multi-GPU use case with very minimal buffer access by CPU.

Does this mean that gTX1050 using PASCAL architecture also supports the same functionality, i.e. GPU memory oversubscription?

Yes, any Pascal+ GPU supports Unified Memory-based GPU memory oversubscription on Linux platforms. See this paragraph in the CUDA programming guide for detailed requirements: Programming Guide :: CUDA Toolkit Documentation

Thanks, your reply saves me from spending money on a much more expensive GPU