Unified Memory in CUDA 6

anon98910598 · January 13, 2014, 11:30pm

How about GTX 650 Ti or GT 650M? They are also listed as Compute Capability 3.0 and kepler structure.

anon95180265 · January 14, 2014, 7:03am

Yes and Yes.

anon20404728 · January 16, 2014, 4:25am

Hi , Mark : Thanks for the exciting introduction to this important new feature. We are wondering if it is possible to pass the FPGA PCIE bar address ( we developped a FPGA PCIE board and GFDMA technology for DMA transfer between FPGA and GPU ) to GPU so that it can deep copy data from FPGA to GPU ? Thankyou !

VisionCtrl Technology Co. , Ltd.

anon95180265 · January 17, 2014, 12:24am

This is not something that is possible with our current GPU architecture. Stay tuned.

anon20404728 · January 18, 2014, 12:15pm

Is there any update about GPUDirect_RDMA technology in CUDA 6 ? We are looking for a similiar solution for WINDOWS os . Thankyou !

anon95180265 · January 22, 2014, 2:48am

You can see details about what's new for GPUDirect RDMA in my SC13 talk on CUDA 6: http://bit.ly/1du71fi GPUDirect RDMA is not yet available on Windows.

anon59165747 · January 31, 2014, 4:47am

Good night,

I wonder if there are any scheduled for launch cuda 6 for registered users to date. I've seen that my vga is compatible (gt 640m).

anon95180265 · February 3, 2014, 11:02pm

(1) Yes, but pages are just the granularity at which dirtiness is tracked. (2) Absolutely not -- don't think of unified memory as a "page cache". You have access to the entire GPU memory (several GBs), not just a few pages in a cache! (3) The default page size is the same as the OS page size today. In the future, we may expose control over this. (4) PCI-express performance is unchanged by Unified Memory. Unified Memory is preferable to mapped host memory because when the data is in device memory (which is the default location for cudaMallocManaged), GPU threads access it at the same performance as any device memory. And when it's in host memory, CPU threads access it at the same performance as any host memory. The thing unified memory is doing is copying only the pages that the CPU (GPU) touches back to the host (device), automatically. On current hardware, coherence is at kernel launch and device sync only. I think one of the biggest benefits is the complex data structure / C++ data sharing this enables between host and device. Don't get hung up on pages.

anon20515662 · February 19, 2014, 6:38pm

We have DMA'ing data directly into GPU's memory since the Fermi devices. This is P2P transfers. What am I missing?

anon95180265 · February 20, 2014, 3:12am

The CUDA 6 Release Candidate is available now! https://developer.nvidia.co...

anon95180265 · February 20, 2014, 3:13am

Not sure I understand the question, but your question implies you haven't read the blog. :) If so, you are missing a lot.

anon20515662 · February 20, 2014, 2:42pm

@visionctrl:disqus wants to DMA data from GPU's SDRAM via PCIe and into an FPGA and you mentioned that this is not possible.

My point was that this we have been doing the reverse of this since the Fermi days w/ CUDA 4.0 and UVA. We push push data directly into GPU's SDRAM. Peer-to-Peer gives this ability, actually in both directions.

The key is latency. By moving data directly into GPU's SDRAM, processing and then displaying it, we can completely by-pass the CPU's SDRAM.

Agreed, Unified Memory makes things simpler from a programming point of view when you need both CPU and GPU memory.

What @visionctrl wants to do should be a lot easier, eh? Just need to run a cuda copy with the PCIe address of the FPGA PCIe BAR address. The DMA engine within the GPU doesn't know if this adress is SDRAM or a FIFO implemented in an FPGA. It's just PCIe.

Same concept of P2P with two GPUs sharing data between each other. Expect in this case, it's between a GPU and a FPGA...

Thoughts?

anon82094053 · February 26, 2014, 9:57am

I can consider the revolution of the Unfied Memory will promote the CPU&GPU's Unfied in the future! As the author say I am completely shocking when I seen this unify!

anon68317552 · February 28, 2014, 4:02pm

You say: 'Don't think about unified memory as a page cache on the device.'
I think I have to disagree here. This is exactly what you should think
if you're referring to how things work under the hood. Otherwise I would
be very surprised. Let me make the point. The virtual address space available
to the CPU is much bigger than the physical memory on the GPU. (Let's forget
for a moment that all address spaces are mapped into one UVA space.)
Let's make an example: Suppose we are talking about a K20 with 6 GB and a
total of 48 GB CPU memory. Let's further assume one manage-allocates 10 chunks
of 1 GB each. First question here: You say the default memory location is the
GPU. What happens when allocating the 7th chunk? Does the first chunk get
copied to CPU memory? Do we have an 'out of memory' error? Or is really a
'first-touch allocation' mechanism at work?

Okay, let's say the allocation of 10 chunks was successful. Now, suppose the
user launches 10 kernels sequentially each using a different memory chunk:
first kernel uses chunk 1, second kernel uses chunk 2, etc. I understand
that unified memory leaves the memory on the device after a kernel launch.
Thus, before launching the 7th kernel we have 6 chunks used by the previous
kernels in GPU memory lying around. The 7th chunk cannot be copied by
the manager to the GPU due to insufficient available memory. There must
be some 'spilling algorithm' at work which decides which chunk to copy
to the CPU in order to free memory for the 7th chunk. LRU comes to mind.

Can you tell us whether there is a caching mechanism at work or whether
unified memory is limited to the GPU memory size?

anon95180265 · March 2, 2014, 11:57pm

On current hardware, Unified Memory does not allow oversubscribing GPU memory. You are limited on managed allocations to the amount of memory available on the GPU (smallest memory if there are multiple GPUs). Future hardware may support GPU page faulting which will enable us to oversubscribe GPU memory, and to do what you describe. In your example, I believe the allocations should start failing after you exceed available GPU memory. Today, pages are not migrated out of GPU memory unless they are touched by the CPU.

anon68317552 · March 3, 2014, 3:56pm

Our application framework allows the user to access a given data
portion from both the CPU and the GPU. In order to provide a high
performance in either case we employ different data layouts depending
on the processor type that makes the access, e.g. AoS vs. SoA. Thus,
we change the data layout on the fly when migrating data between GPU
and CPU memory domains.

Now, since unified memory does the data migration for you I guess it's
job is done by just copying the data. Thus I assume it's not possible
to manage-allocate a chunk of memory and pass a user-defined data
layout transformation function to the malloc call. I am talking about
a optional software hook that would get called when data was migrated
by the driver/manager into a 'staging area' and from there it could be
processed and stored into it's final destination by a user-defined
function.

Such a feature would be nice to have.

anon95180265 · March 3, 2014, 11:39pm

Unified Memory migration is at the page level. It would be very difficult to generically handle user-defined memory transformations like you describe at that level. I don't know of any CPU allocators that apply transformations, for example. If it requires explicit memcopies anyway, then Unified Memory doesn't gain you much.

As I pointed out in the article, it's going to be difficult for an automatic system like this to outperform custom-tuned memory management frameworks like you describe -- the programmer usually has more information than the runtime or compiler. Since you already have a framework, there is no reason you can't keep using it.

anon52225913 · March 9, 2014, 6:59pm

Yeah, but it's still only supported from Kepler onwards.

anon99195792 · April 21, 2014, 12:03pm

Is "elem->name" a typo?

dataElem is defined as:

struct dataElem {
int prop1;
int prop2;
char *text;
}

You should use elem->text instead.

anon44464960 · April 21, 2014, 7:21pm

Good catch Coiby! I'll change the "char *text" in dataElem to "char *name".

Topic		Replies	Views
Unified Memory for CUDA Beginners Technical Blog	46	2564	December 1, 2023
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1238	May 14, 2019
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
An Even Easier Introduction to CUDA Technical Blog	141	6360	November 28, 2023
CUDA 4.0 CUDA Programming and Performance	63	507398	March 28, 2013
Improving GPU Memory Oversubscription Performance Technical Blog	4	850	November 2, 2021
CUDA 8 Features Revealed Technical Blog	51	863	November 8, 2018
Abysmal performance with Unified Memory and CUBLAS CUDA Programming and Performance	15	4295	November 29, 2014
Beyond GPU Memory Limits with Unified Memory on Pascal Technical Blog	15	888	March 11, 2022
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134574	May 26, 2010

Unified Memory in CUDA 6

Related topics