Unified Memory in CUDA 6

jwitsoe · November 18, 2013, 4:01pm

Originally published at: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/

With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer…

anon93880929 · November 18, 2013, 8:17pm

As someone who works with CUDA Fortran, I am hoping the day comes soon when NVIDIA/PGI Fortran also includes a similar functionality. I'd really like to get rid of all those freaking cudaMemcpy's in my code!

anon95180265 · November 19, 2013, 6:29am

We will be rolling out Unified Memory for additional languages and platforms in future releases of CUDA (and CUDA Fortran).

anon32774346 · November 19, 2013, 1:28pm

great functionality, definitely a move in the right direction for allowing porting existing code rather than rewriting. can we expect virtual function table rewiring for true C++ object copying to device? any support for STL on device (start with vector, shared_ptr) - even just read-only?

anon81083576 · November 20, 2013, 6:13pm

Very nice from CUDA 6. Really eager to get started with this. Just one question; does unified memory reads both PC and GPU as one combined memory or is still seen as seperate and CUDA automatically does all the copying instead of the programmer?

anon86664594 · November 21, 2013, 3:03am

Eager to get started with this version. And if I have a var like this "int *raw_ptr" with NxN size, can I have another var such as "int **ptrs" to point to the data of raw_ptr, ie "*ptrs[0]=raw_ptr[0];*ptrs[1]=raw_ptr[N-1]; " ? Thanks a lot

anon8077768 · November 22, 2013, 12:31am

I've written a system for abstracting memory copies into my API, so the user can just use his data on the CPU and GPU seamlessly, using a checksum internally to determine if anything has changed and only transferring as late as necessary. Every part of the API is made more complex because of this. I'm really looking forward to just deleting all of that logic.

anon95180265 · December 4, 2013, 5:31pm

The problem with virtual function tables is that AFAIK the C++ standard does not specify the format/layout/implenmentation of vftables. This makes it nearly impossible to support calling virtual functions on shared objects across all C++ host compilers supported by CUDA / nvcc. As for STL, that is something that we intend to look at, but nothing I can share here yet.

anon95180265 · December 4, 2013, 5:34pm

On current hardware, the latter -- in a PC today the GPU and CPU memories are physically discrete. But the same programming model could be used for physically unified memory systems, such as Tegra SOCs.

anon95180265 · December 4, 2013, 5:35pm

It's all just memory, so yes. I didn't mention in the above post, but there is also a `__managed__` declaration specifier, which allows you to declare global managed device pointers.

anon94078803 · December 10, 2013, 2:06pm

Will CUDA Unified Memory be supported on GTX cards in Windows 7 and 8 or will it be limited to Tesla cards (due to requirement for TCC driver)? I am really looking forward to using Unified Memory in my CUDA applications, but do not want to limit my clients to using Tesla cards only.

anon95180265 · December 10, 2013, 8:41pm

Unified Memory will be supported on Compute Capability 3.0 and later (sm_30 - so K10 or GTX 680 or later), on 64-bit Linux, Windows 7 and Windows 8 in CUDA 6.0. Support for other operating systems will come at a later date.

anon54930070 · December 11, 2013, 12:18pm

Does unified memory support to overlap execution with data transfers on default stream? Or still do I need to split the operations with cudaMemcpyAsync and put them in separate streams for ovelapping?

anon95180265 · December 12, 2013, 3:40pm

You can always use cudaMemcpyAsync to explicitly copy data and overlap it with kernels in other stream. Unified Memory does not take away your ability to optimize.

In CUDA 6, pages from managed allocations that were touched on the CPU are migrated back to the GPU just before any kernel launch -- so there is no overlap with that kernel. However you can still get overlap between multiple kernels in separate streams.

Also, not discussed in this post, is an API in CUDA 6 that allows you to attach a managed allocation to a specific stream, so that you can control which allocations are synchronized on specific kernel launches, and increase concurrency.

Future CUDA releases will add more optimizations, such as prefetching.

anon11880740 · December 17, 2013, 7:56am

What about the constant memory. I would like to be able allocate it for example like this: int* pint = cudaConstMalloc();

And free it like that: cudaConstFree(pint);

Or by using: cudaConstMallocManaged();

anon45393169 · December 18, 2013, 4:08am

Its a very nice article a small note on C++ this is a has-a class and not is-a class...so there is no need for inheritance. ;-)

anon95180265 · December 18, 2013, 3:23pm

Unfortunately due to the implementation of constant banks in the hardware this is not possible at this time.

anon95180265 · December 18, 2013, 3:28pm

We want the class to satisfy "is a Managed class", so I believe inheritance of Managed is warranted in this case. If you disagree, can you provide an example of how this would work with a has-a implementation?

anon45393169 · December 18, 2013, 3:52pm

Oh so deeply sorry about this, I was very absent minded now I see it. Yes of course it is has-a string and is-a managed. Apologies it was too late at night. :-)

anon45393169 · December 19, 2013, 7:33am

And I must say you have demonstrated how well you can instantiate classes to be CUDA 6.0 managed objects thanks for putting this example. :-)

Topic		Replies	Views
Unified Memory for CUDA Beginners Technical Blog	46	2581	December 1, 2023
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1254	May 14, 2019
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204319	April 13, 2009
An Even Easier Introduction to CUDA Technical Blog	141	6393	November 28, 2023
CUDA 4.0 CUDA Programming and Performance	63	507400	March 28, 2013
Improving GPU Memory Oversubscription Performance Technical Blog	5	861	June 5, 2025
CUDA 8 Features Revealed Technical Blog	51	864	November 8, 2018
Abysmal performance with Unified Memory and CUBLAS CUDA Programming and Performance	15	4300	November 29, 2014
Beyond GPU Memory Limits with Unified Memory on Pascal Technical Blog	15	892	March 11, 2022
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134582	May 26, 2010

Unified Memory in CUDA 6

Related topics