Unified Memory in CUDA 6

Originally published at: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/

With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer…

As someone who works with CUDA Fortran, I am hoping the day comes soon when NVIDIA/PGI Fortran also includes a similar functionality. I'd really like to get rid of all those freaking cudaMemcpy's in my code!

We will be rolling out Unified Memory for additional languages and platforms in future releases of CUDA (and CUDA Fortran).

great functionality, definitely a move in the right direction for allowing porting existing code rather than rewriting. can we expect virtual function table rewiring for true C++ object copying to device? any support for STL on device (start with vector, shared_ptr) - even just read-only?

Very nice from CUDA 6. Really eager to get started with this. Just one question; does unified memory reads both PC and GPU as one combined memory or is still seen as seperate and CUDA automatically does all the copying instead of the programmer?

Eager to get started with this version. And if I have a var like this "int *raw_ptr" with NxN size, can I have another var such as "int **ptrs" to point to the data of raw_ptr, ie "*ptrs[0]=raw_ptr[0];*ptrs[1]=raw_ptr[N-1]; " ? Thanks a lot

I've written a system for abstracting memory copies into my API, so the user can just use his data on the CPU and GPU seamlessly, using a checksum internally to determine if anything has changed and only transferring as late as necessary. Every part of the API is made more complex because of this. I'm really looking forward to just deleting all of that logic.

The problem with virtual function tables is that AFAIK the C++ standard does not specify the format/layout/implenmentation of vftables. This makes it nearly impossible to support calling virtual functions on shared objects across all C++ host compilers supported by CUDA / nvcc. As for STL, that is something that we intend to look at, but nothing I can share here yet.

On current hardware, the latter -- in a PC today the GPU and CPU memories are physically discrete. But the same programming model could be used for physically unified memory systems, such as Tegra SOCs.

It's all just memory, so yes. I didn't mention in the above post, but there is also a `__managed__` declaration specifier, which allows you to declare global managed device pointers.

Will CUDA Unified Memory be supported on GTX cards in Windows 7 and 8 or will it be limited to Tesla cards (due to requirement for TCC driver)? I am really looking forward to using Unified Memory in my CUDA applications, but do not want to limit my clients to using Tesla cards only.

Unified Memory will be supported on Compute Capability 3.0 and later (sm_30 - so K10 or GTX 680 or later), on 64-bit Linux, Windows 7 and Windows 8 in CUDA 6.0. Support for other operating systems will come at a later date.

Does unified memory support to overlap execution with data transfers on default stream? Or still do I need to split the operations with cudaMemcpyAsync and put them in separate streams for ovelapping?

You can always use cudaMemcpyAsync to explicitly copy data and overlap it with kernels in other stream. Unified Memory does not take away your ability to optimize.

In CUDA 6, pages from managed allocations that were touched on the CPU are migrated back to the GPU just before any kernel launch -- so there is no overlap with that kernel. However you can still get overlap between multiple kernels in separate streams.

Also, not discussed in this post, is an API in CUDA 6 that allows you to attach a managed allocation to a specific stream, so that you can control which allocations are synchronized on specific kernel launches, and increase concurrency.

Future CUDA releases will add more optimizations, such as prefetching.

What about the constant memory. I would like to be able allocate it for example like this: int* pint = cudaConstMalloc();

And free it like that: cudaConstFree(pint);

Or by using: cudaConstMallocManaged();

Its a very nice article a small note on C++ this is a has-a class and not is-a class...so there is no need for inheritance. ;-)

Unfortunately due to the implementation of constant banks in the hardware this is not possible at this time.

We want the class to satisfy "is a Managed class", so I believe inheritance of Managed is warranted in this case. If you disagree, can you provide an example of how this would work with a has-a implementation?

Oh so deeply sorry about this, I was very absent minded now I see it. Yes of course it is has-a string and is-a managed. Apologies it was too late at night. :-)

And I must say you have demonstrated how well you can instantiate classes to be CUDA 6.0 managed objects thanks for putting this example. :-)