I’m working on a low-latency application where the amount of data to be copied back from device to host is determined at runtime by a kernel. The maximum size is known in advance, so I can pre-allocate a buffer on the host before launching the kernel.
To perform the copy, I’m currently using two cudaMemcpy calls:
The first retrieves the size.
The second copies the actual data using that size.
Unfortunately, the first cudaMemcpy (just to read one integer) adds noticeable overhead to the overall runtime (see attached image).
To the best of my knowledge, there’s no version of cudaMemcpy that takes a size parameter (count) that resides on the device. I’ve considered using Unified Memory to avoid the extra copy, but in my case, that ends up performing worse than the two-step approach.
Ideally, I’d like to avoid the first copy altogether. Has anyone run into a similar situation? Are there any tricks or best practices to reduce this latency without relying on Unified Memory?
From the picture, one can tell that your destination buffer is in pinned memory. Pinned memory can be accessed from within a kernel. A simple solution for your problem would be to use a copy kernel to copy the data from a device memory buffer into the destination pinned memory buffer.
The buffer is indeed pinned and allocated with cudaMallocHost.
I wasn’t aware that kernels can perform memcpy-like operations directly. Do you happen to have a reference or example I could look at to learn more about how that works?
I was able to shave off a few microseconds by launching a lightweight kernel to write the size, then synchronizing and using cudaMemcpy with the correct amount of data.
I also experimented with doing everything inside a single large copy kernel. However, in cases where there’s more data to copy, the kernel-based strategy turns out to be noticeably slower, and the overhead ends up outweighing the gains.
Normally cudaMemcpy should not be much faster than a copy kernel. The PCIe speed should be the limiting factor. Especially, if it is the only kernel running at that time.