Unified Memory in CUDA 6

How about GTX 650 Ti or GT 650M? They are also listed as Compute Capability 3.0 and kepler structure.

Yes and Yes.

Hi , Mark : Thanks for the exciting introduction to this important new feature. We are wondering if it is possible to pass the FPGA PCIE bar address ( we developped a FPGA PCIE board and GFDMA technology for DMA transfer between FPGA and GPU ) to GPU so that it can deep copy data from FPGA to GPU ? Thankyou !

VisionCtrl Technology Co. , Ltd.

This is not something that is possible with our current GPU architecture. Stay tuned.

Is there any update about GPUDirect_RDMA technology in CUDA 6 ? We are looking for a similiar solution for WINDOWS os . Thankyou !

You can see details about what's new for GPUDirect RDMA in my SC13 talk on CUDA 6: http://bit.ly/1du71fi GPUDirect RDMA is not yet available on Windows.

Good night,

I wonder if there are any scheduled for launch cuda 6 for registered users to date. I've seen that my vga is compatible (gt 640m).

(1) Yes, but pages are just the granularity at which dirtiness is tracked. (2) Absolutely not -- don't think of unified memory as a "page cache". You have access to the entire GPU memory (several GBs), not just a few pages in a cache! (3) The default page size is the same as the OS page size today. In the future, we may expose control over this. (4) PCI-express performance is unchanged by Unified Memory. Unified Memory is preferable to mapped host memory because when the data is in device memory (which is the default location for cudaMallocManaged), GPU threads access it at the same performance as any device memory. And when it's in host memory, CPU threads access it at the same performance as any host memory. The thing unified memory is doing is copying only the pages that the CPU (GPU) touches back to the host (device), automatically. On current hardware, coherence is at kernel launch and device sync only. I think one of the biggest benefits is the complex data structure / C++ data sharing this enables between host and device. Don't get hung up on pages.

We have DMA'ing data directly into GPU's memory since the Fermi devices. This is P2P transfers. What am I missing?

The CUDA 6 Release Candidate is available now! https://developer.nvidia.co...

Not sure I understand the question, but your question implies you haven't read the blog. :) If so, you are missing a lot.

@visionctrl:disqus wants to DMA data from GPU's SDRAM via PCIe and into an FPGA and you mentioned that this is not possible.

My point was that this we have been doing the reverse of this since the Fermi days w/ CUDA 4.0 and UVA. We push push data directly into GPU's SDRAM. Peer-to-Peer gives this ability, actually in both directions.

The key is latency. By moving data directly into GPU's SDRAM, processing and then displaying it, we can completely by-pass the CPU's SDRAM.

Agreed, Unified Memory makes things simpler from a programming point of view when you need both CPU and GPU memory.

What @visionctrl wants to do should be a lot easier, eh? Just need to run a cuda copy with the PCIe address of the FPGA PCIe BAR address. The DMA engine within the GPU doesn't know if this adress is SDRAM or a FIFO implemented in an FPGA. It's just PCIe.

Same concept of P2P with two GPUs sharing data between each other. Expect in this case, it's between a GPU and a FPGA...

Thoughts?

I can consider the revolution of the Unfied Memory will promote the CPU&GPU's Unfied in the future! As the author say I am completely shocking when I seen this unify!

You say: 'Don't think about unified memory as a page cache on the device.'
I think I have to disagree here. This is exactly what you should think
if you're referring to how things work under the hood. Otherwise I would
be very surprised. Let me make the point. The virtual address space available
to the CPU is much bigger than the physical memory on the GPU. (Let's forget
for a moment that all address spaces are mapped into one UVA space.)
Let's make an example: Suppose we are talking about a K20 with 6 GB and a
total of 48 GB CPU memory. Let's further assume one manage-allocates 10 chunks
of 1 GB each. First question here: You say the default memory location is the
GPU. What happens when allocating the 7th chunk? Does the first chunk get
copied to CPU memory? Do we have an 'out of memory' error? Or is really a
'first-touch allocation' mechanism at work?

Okay, let's say the allocation of 10 chunks was successful. Now, suppose the
user launches 10 kernels sequentially each using a different memory chunk:
first kernel uses chunk 1, second kernel uses chunk 2, etc. I understand
that unified memory leaves the memory on the device after a kernel launch.
Thus, before launching the 7th kernel we have 6 chunks used by the previous
kernels in GPU memory lying around. The 7th chunk cannot be copied by
the manager to the GPU due to insufficient available memory. There must
be some 'spilling algorithm' at work which decides which chunk to copy
to the CPU in order to free memory for the 7th chunk. LRU comes to mind.

Can you tell us whether there is a caching mechanism at work or whether
unified memory is limited to the GPU memory size?

On current hardware, Unified Memory does not allow oversubscribing GPU memory. You are limited on managed allocations to the amount of memory available on the GPU (smallest memory if there are multiple GPUs). Future hardware may support GPU page faulting which will enable us to oversubscribe GPU memory, and to do what you describe. In your example, I believe the allocations should start failing after you exceed available GPU memory. Today, pages are not migrated out of GPU memory unless they are touched by the CPU.

Our application framework allows the user to access a given data
portion from both the CPU and the GPU. In order to provide a high
performance in either case we employ different data layouts depending
on the processor type that makes the access, e.g. AoS vs. SoA. Thus,
we change the data layout on the fly when migrating data between GPU
and CPU memory domains.

Now, since unified memory does the data migration for you I guess it's
job is done by just copying the data. Thus I assume it's not possible
to manage-allocate a chunk of memory and pass a user-defined data
layout transformation function to the malloc call. I am talking about
a optional software hook that would get called when data was migrated
by the driver/manager into a 'staging area' and from there it could be
processed and stored into it's final destination by a user-defined
function.

Such a feature would be nice to have.

Unified Memory migration is at the page level. It would be very difficult to generically handle user-defined memory transformations like you describe at that level. I don't know of any CPU allocators that apply transformations, for example. If it requires explicit memcopies anyway, then Unified Memory doesn't gain you much.

As I pointed out in the article, it's going to be difficult for an automatic system like this to outperform custom-tuned memory management frameworks like you describe -- the programmer usually has more information than the runtime or compiler. Since you already have a framework, there is no reason you can't keep using it.

Yeah, but it's still only supported from Kepler onwards.

Is "elem->name" a typo?

dataElem is defined as:

struct dataElem {
int prop1;
int prop2;
char *text;
}

You should use elem->text instead.

Good catch Coiby! I'll change the "char *text" in dataElem to "char *name".