Is cudaMemcpy() real-time safe?

I need to copy data to and from the GPU at some periodic interval for an audio application. The call to cudaMemcpy() will be inside the real-time thread so it should be real-time safe. This is, it should not do memory (de)allocations or other system calls. It should have a consistent run-time.
Is this the case for cudaMemcpy()?
And what about cudaMemcpyAsync()?

I’m not sure what a real-time thread is. The operating systems I am familiar with for CUDA (linux, windows) are not generally thought of as RTOS. (There is a supported method to use CUDA with a particular RTOS, but that doesn’t pertain to publicly/generally/freely available CUDA technology.) I don’t doubt that there are probably things you can do with linux that give some RT-character, but I don’t think CUDA advertises as being tested with any of those. For me, If you have to apply a “patch” of any sort, that might suggest you might be in an untested area of CUDA usage.

I don’t think it is specified or documented whether cudaMemcpy does any system calls. That often means you should not rely on such a thing, if there is no stated expectation. But a casual test for me suggests that cudaMemcpy does not do any system calls as reported by strace.

Not sure what that means. cudaMemcpy() is a blocking call (thread does not return from the library call until the copy operation is complete), and therefore I would expect its “run-time” to be strongly correlated to the size of the transfer. Different calls to cudaMemcpy(), with different transfer sizes, could have quite different “run-times”. Furthermore, cudaMemcpy in a particular direction (say, host to device) typically relies on the PCIE link to make such a transfer. If that link is busy with other transfers in the same direction, the latency of any cudaMemcpy call could be extended by the duration of those other, previous transfers as well. (We would probably be talking about cudaMemcpyAsync for the previous transfers, in this case.)

Simple strace test case:

# cat
#include <cstdio>

int main(){

  float *h, *d;
  h = new float[32768];
  cudaMalloc(&d, 32768*sizeof(float));
  cudaMemcpy(d, h, 32768*sizeof(float), cudaMemcpyHostToDevice);
# nvcc -o t167
# strace ./t167

write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xb8), 0x7fff6b329d10) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7fff6b328c60) = 0
ioctl(4, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7fff6b329b50) = 0
ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7fff6b3296c0) = 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}, AT_EMPTY_PATH) = 0
write(1, "abcdefg\n", 8abcdefg
)                = 8
write(1, "abcdefgh\n", 9abcdefgh
)               = 9
exit_group(0)                           = ?
+++ exited with 0 +++

Thanks a lot for the reply.
It is needed for an audio plugin that operates under Windows. I know Windows does not has real-time threads. But the thread that processes the audio in an audio plugin is often called the real-time thread. Because the processing of audio samples should be within a consistent time. So no things like memory allocations or system calls with unknown runtime are allowed. The main goal is that it runs always (as far as possible) within a certain time.
I know no one can guarantee this. But one should do his best to make that happen.

So with consistent runtime for cudaMemcpy I mean that time the function takes to execute is each time more or less the same (using the same parameters, thus the same transfer direction and size). It should not be that it runs mostly in 20 uSec but so now and then in 1 ms (e.g. because of a system call).
Good to know that it probably does no runtime calls.

Can you say that of cudaMemcpyAsync too? Does it also do no system calls? Because maybe the run time of this one is more consistent because it does not wait for the transfer to be finished.

P.S. what is the OUPUT SNIPPED above?

This suggests that your application can be characterized as having soft real-time requirements: an occasional missed deadline may degrade the quality of service but does not result in catastrophic failure. A typical example would be a dropped frame in a video application. If the frequency of missed deadlines increases, the application may be able to adapt, for example by reducing screen resolution in the case of the video application.

CUDA running on top of Windows and Linux is used in soft real-time applications. But these OS platforms and the standard CUDA software stack generally provide a “best effort” service without guaranteed deadlines. If you think about a memory copy operation just on the host system, its execution time is impacted by caches, TLBs, DRAM controllers with just a singe thread of execution. Now add additional threads, a second processor (GPU), a packetized interconnect (PCIe), and possible resource contention on the interconnect as well as both source and destination memory, and it is clear why the time to copy a particular sized chunk of memory can vary quite a bit. There is no clearly defined and guaranteed time limit as one would need for hard real-time operation.

I’ve given you enough of an example to explore system call behavior on Linux in the example I already provided. You can do a similar test yourself with cudaMemcpyAsync.

It means I removed some of the output from the strace tool, that I considered unimportant. The important stuff is what is between the two printf statements in the strace output I provided. if you want to see what I snipped out (its quite long) you can run the test case yourself, on linux.

I don’t have much experience with these kinds of explorations on windows, so can’t say much there.

And when I use cudaMemcpyAsync?
It will not wait for the transfer. So it doesn’t matter if other tasks needs PCIe bandwidth.

It is necessary to distinguish between blocking behavior and deadlines for the copy operation. cudaMemcpyAsync is non-blocking for the host thread that invokes it. It basically deposits configuration data for a DMA mechanism into a queue and returns. That means the time spent in cudaMemcpyAsync is small, but it can be variable. An interesting question is whether it is possible to completely fill the finite-length queue, at which point cudaMemcpyAsync would block because it is waiting to insert into the queue. I know this can happen with kernel launches, which are also asynchronous to the host thread, so I guess the possibility exists at least theoretically for cudaMemcpyAsync.

The other component is whether the copy operation requested will have finished after a particular amount of elapsed time. The copy request is queue up, and one cannot generally say when the actual copy will start. Because of possible resource contention on interconnect and memories at either end the duration of the copy may vary and one cannot fix a time at which the copy operation is guaranteed to have completed. If the code executing on the GPU has a dependency on the copied data, we can therefore not guarantee that this code will commence execution by a particular time.

Since cudaMemcpyAsync involves physically contiguous memory on the host side, performance impact and variability from the virtual memory system of the OS (and its hardware underpinnings) is eliminated, but all other factors that can affect copy performance variability are still present.

I only have to copy 4KB. So on a 16 lane PCIe 4 it could be done in 128 ns. And I have 1 ms. So it probably will be done on time. But I realize that you can not know if the DMA has other stuff in the queue.
And I guess that the cudaMemcpy() also uses DMA.

Maybe Unified Memory is a solution? I will create a new thread on this forum about this so this thread stays on its topic.

I would say that the likelihood of being able to deliver a 4KB block across PCIe with a 1 millisecond deadline is exceedingly high, near certainty. And this is why diverse applications with soft real-time requirements work well with CUDA acceleration.

Caution needs to be exercised when computing transfer times across PCIe. PCI uses packetized transport at multiple levels of the protocol. At the bottom of the protocol stack transfers typically are split into 256-byte chunks, resulting in a maximum efficiency of approximately 85% compared to theoretical bandwidth. At the transaction level, there is a fixed-size penalty due to protocol overhead and PCIe controllers. For the older PCIe gen3 x16 interconnect, I gave the following timings in a previous post (note the fixed overhead of 1.125 microseconds per transfer):

A 16 KB transfer requires 1.25 microseconds to transmit, a 64KB transfer requires 5 microseconds to transmit, and a 256 KB transfer requires 20 microseconds to transmit. So in one second = 1 million microseconds, we achieve 1e6/(1.25+1.125) = 421000 transfers of 16 KB each for a total of 6.9e9 bytes, or 1e6/(5+1.125) = 163200 transfers of 64 KB for a total of 10.7e9 bytes, or 1e6/(20+1.125) = 47300 transfers of 256 KB each for a total of 12.4e9 bytes. As transfer size grows even further, the effective transfer rate will approach 13.1 GB/sec asymptotically.

I don’t have the equivalent measurements for PCIe gen 4 x16, but I think it is clear that a 4KB transfer likely requires on the order of one microsecond, which is still small compared to a one millisecond deadline.

One difference between cudaMemcpy and cudaMemcpyAsync that I did not mention previously and that does not apply here because the transfer size is only 4 KB. DMA copying requires physically contiguous buffers. In the case of cudaMemcpy this is a fixed-sized buffer allocated by the driver, which is a couple MB in size (I have not checked on it recently). So larger transfers require piece-wise system memory copies between system memory and this pinned buffer plus DMA transfers between this buffer and the GPU. In the case of cudaMemcpyAsync the programmer selects the size of the pinned memory block, and the DMA controller pulls data from that (or delivers data to it) directly, which makes it more efficient.

Ok, thx a lot for this information.
Do you think using Unified Memory is a better solution for this? In which case the CPU writes the data directly to some buffer which the GPU can read. And visa versa. No DMA move needed.
The only problem may be that there are page fault interrupts to be handled. But I made a new thread on this forum for this question.

Unified memory looks unified to the programmer. Physically there is still he DDR4 DRAM in the host system and GDDR6X on the GPU. Data needs to flow across PCIe to get from one side to the other, and transfers will be limited in latency and throughput by this interconnect, regardless of how transfers are initiated and controlled. If we boil it down to basics, a DMA engine is just a set of address registers and a counter. The address registers increment / decrement as needed, the counter counts down to zero, and in this fashion a series of reads or writes of particular length across PCIe are generated. To the PCIe interconnect they are like any other reads and writes.

I don’t think I am going out on much of a limb when I claim that use of unified memory cannot achieve better performance than well-managed use of cudaMemcpyAsync. Hard evidence to the contrary would consider me to revise this stance. What unified memory does is provide a higher level of abstraction to programmers and that is often useful (the same principle resulted first in automatic transmissions and anti-lock brakes in automobiles and now CVT and regenerative braking).

Well, I got the impression that Unified Memory only exists on the CPU or GPU. Not both. One of the advantages was that in total you use less memory (not memory on both sides).
So I hoped when the memory is allocated on the GPU for instance that a read or write of the CPU translates to a read or write over the PCIe bus. Thus no DMA. Just a normal memory read or write bu over the PCIe bus to the GPU memory. That would make the read and writes real-time safe becausw there are no system calls.
But then I also did read something about page fault interrupts to switch pages. That part I do not understand.