Can I use Unified Memory in a soft real-time system?

Hi,
I need to move data to and from the GPU for an audio application. I understand this is a soft real-time system. So I want this copy to take a predictable (maxium) time.
Is that possible?
I did read that in a Unified Memory system a page fault interrupt is used to switch pages. Is this done in a predictable amount of time? Or is there a risk that all kind of unknown system calls (like memory allocation)?

Is there a way to use the Unified Memory model such that the read and writes are suitable for a soft real-time system?

windows has a substantially different unified memory behavior compared to linux. A simplistic description of UM on windows is copying all UM allocations from host to device at the start of a kernel launch process. There is no magic here. No reason to think its much different than the requisite number of cudaMemcpy operations, immediately preceding each kernel launch.

But isn’t the idea of UM that you can read and write the memory at all time? Thus also during the kernel execution?

And how can the the system know which memory to copy? When I define several buffers in UM for different kernels and then start one of those kernels, how does the system know which UM buffers belongs to this?

The idea of UM is that I can use a single pointer to refer to data, whether I am in host code or device code.

Whether or not you can “read and write the memory at all time” varies by platform/OS. Maxwell or Kepler GPUs, or any GPU on windows, will have the concurrent managed access property as false, which means that when a kernel is running, (or prior to a cudaDeviceSynchronize(), but after a kernel launch) UM allocations are inaccessible to host code.

A pinned allocation is something that can accessed by either host code or device code, “all the time”.

And a pinned (UM) allocation is not copied at the start of the kernel as you described before?

How does pinned UM memory works? Is each access transferred over the PCIe bus?

pinned memory is separate from UM. It is not automatically copied at the kernel launch point.

pinned memory is described in a variety of places, including unit 7 of this online training series.

Yes, one feature of pinned memory is that it is directly accessible from device code, and in that case the accesses trigger transfers across the PCIE bus.

Ok, Thx for clearing this up. So for our (soft) real-time application we definitely should use pinned memory.

Will it also be possible (using pinned memory) to transfer some data to the GPU while the kernels are executing?

Suppose we want very low latency and if after starting the kernels there is some user input that we want to have effect on the current calculations. We would like to be able to write this new input to memory and move it to the GPU so that the GPU will be able to use that.

Is there a way to do that?
Thus can we copy data to the GPU while the kernel is executing and will the kernel be able to detect that? E.g.: reading some flag using volatile pointers so that it does skip the device cache? OR using atomics in global memory that both GPU and CPU can access?

Generally speaking it is possible to move data between host and device to/from a running kernel. It can be more challenging in a Windows WDDM setting (as compared to Linux) but I’ve heard anecdotally that the Windows GPU Hardware scheduling feature impacts this (in a positive way).

if you want to copy data to a running kernel, the general suggestion I would have is to use zero-copy methods, which is like saying “use pinned memory”, and directly access the data (copy it) using device/kernel code. This is a typical methodology that has existed in CUDA since its inception.

I assume you are pointing to the Mapped Memory feature?

What about cache in this case?
If we map a CPU memory region into GPU space how can we be sure that the values we write with the CPU are seen by the GPU and visa versa?

I think you could ask the same questions about cudaMemcpy, or even the two cases you had in mind where you said:

and

I’m not sure what is different between the case where you populate pinned memory and then launch a kernel, or launch a kernel and then populate pinned memory. If you fundamentally believe that the pinned memory mechanism allows data to stay in a CPU cache and never make it to the target, then neither case is reliable. I won’t be able to explain it in much detail beyond the links I’ve given already. I won’t be able to give a detailed description of CPU cache behavior.

If you believe something doesn’t work or isn’t reliable then you probably shouldn’t use it. And of course, carefully testing any implementation you settle on is certainly a good idea.

It is not that I do not believe things do not work. I’m just trying to understand all this.

According your answers I concluded that I should use Pinned Memory.
And according your remark about using zero-copy methods, I assume we should use Mapped Memory. Zero-copy means the use of Mapped Memory, right?

I searched for information about that and I learned that caching may be a problem. That is why I asked if that would be a problem. And how to deal with that.

I did read a paper where they did special things (which I do not yet understand) to solve the cache issue (on the GPU side). But that paper is 10 years old.
But I also did read somebody saying that GPU cache is disabled for mapped memory. If that is the case there is no cache issue on the GPU side.

I think there is still a potential cache problem for the CPU side. We need to be sure writes do reach the memory and that reads will not be cached.

Well, it could be that caches are flushed at the start of the kernel for instance. And maybe again at the and of a kernel. I just want to know if what we want is possible and could work correctly.

when you pin memory, in any modern setting, it is automatically “mapped”. That means it is accessible, using a single numerical pointer value, from either host or device code. So in my mind, pinning memory implies mapping.

zero-copy means taking advantage of this, in GPU kernel code, to directly access this memory, make the data it contains available in device code, without (necessarily) any use of cudaMemcpy-type activity. Thus the moniker “zero-copy”. Of course, the direct access mechanism from device code still involves movement of data across the PCIE bus.

I’m not aware of any caching issues on the CPU side. On the GPU side, it should be sufficient to mark the pointer volatile, which avoids the L1 cache. AFAIK the L2 is not used for access to the “sysmem” space, which is that portion of the logical global space that results in bus cycles to access data resident in host memory. (It seems that volatile has a similar meaning in C++, although I don’t know if I have ever used that.)

I will not be able to explain CPU caching activity around this or most other use-cases for CUDA. I’ve never run into it as a concern. I don’t know of specifications for CPU cache behavior on CUDA function calls, kernel calls, or other CUDA activity.

I don’t know if what you want is possible and could work correctly.

Thanks for all the information. This is really valuable!
I know enough now to do some testing.