Avoid retrieve cudaMemcpy size from GPU

Hello,

I’m working on a low-latency application where the amount of data to be copied back from device to host is determined at runtime by a kernel. The maximum size is known in advance, so I can pre-allocate a buffer on the host before launching the kernel.

To perform the copy, I’m currently using two cudaMemcpy calls:

  • The first retrieves the size.
  • The second copies the actual data using that size.

Unfortunately, the first cudaMemcpy (just to read one integer) adds noticeable overhead to the overall runtime (see attached image).

To the best of my knowledge, there’s no version of cudaMemcpy that takes a size parameter (count) that resides on the device. I’ve considered using Unified Memory to avoid the extra copy, but in my case, that ends up performing worse than the two-step approach.

Ideally, I’d like to avoid the first copy altogether. Has anyone run into a similar situation? Are there any tricks or best practices to reduce this latency without relying on Unified Memory?

Thanks in advance!

From the picture, one can tell that your destination buffer is in pinned memory. Pinned memory can be accessed from within a kernel. A simple solution for your problem would be to use a copy kernel to copy the data from a device memory buffer into the destination pinned memory buffer.

Thank you for the answer!

The buffer is indeed pinned and allocated with cudaMallocHost.

I wasn’t aware that kernels can perform memcpy-like operations directly. Do you happen to have a reference or example I could look at to learn more about how that works?

Thanks again!

There is nothing special about it. See below (written in browser)

__global__
void copykernel(const int* src, int* dst, const int* Nptr){
    const int N = *Nptr;
    for(int i = threadIdx.x + blockIdx.x * blockDim.x; i < N; i+= blockDim.x * gridDim.x){
        dst[i] = src[i]
    }
}

copykernel<<<4096,256>>>(deviceptr, pinnedptr, Ndeviceptr);

Thank you!

I was able to shave off a few microseconds by launching a lightweight kernel to write the size, then synchronizing and using cudaMemcpy with the correct amount of data.

I also experimented with doing everything inside a single large copy kernel. However, in cases where there’s more data to copy, the kernel-based strategy turns out to be noticeably slower, and the overhead ends up outweighing the gains.

Normally cudaMemcpy should not be much faster than a copy kernel. The PCIe speed should be the limiting factor. Especially, if it is the only kernel running at that time.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.