Small random memcpy (device to device) on GPU

Hi, I need to do small random memcpy (512B) on GPU K80 (DevicetoDevice). I used the CudaMemcpy with cudaMemcpyDeviceToDevice. It took long time. Is there any fast memcpy for small random access
on GPU?

could you provide more context

i.e. at what point must the memory copy be completed?
why must it be device to device?
can the memory copy only occur within/ during/ after the kernel?
what exactly do you mean by random access?
what is the end goal?

Hi,
Basically, I have 2 sets of GPU memory buffers. I would like to copy data from the first buffer to second buffer and then run the kernel on second buffer. Here random means two consecutive 512 B memory copy may not be contiguous. I need the fastest way to copy small amount of data within GPU.

@De_Arup,

It’s not clear but I’m assuming you mean that you want to copy 512 contiguous bytes and are finding cudaMemcpyD2D too slow or too latent.

Since you’re only copying a small amount of data I would create a single-warp “copy kernel” and launch it immediately after the first kernel is launched.

The kernel execution configuration would be:

copy_512_kernel<<<1,32>>>(uint4* const dst, const uint4* const src)

The kernel would look like:

__global__ void 
copy_512_kernel(uint4* const dst, const uint4* const src)
{
  dst[threadIdx.x] = src[threadIdx.x];
}

I use a similar kernel to copy variable amounts of data (snapshots of device-side atomics) back to the host. I’m only copying 4-16 bytes of data and just want it to copy as soon as the kernel is complete.

Runtime is < 2 usecs.

The cudaMemcpy() function might have special cases for small copies but I doubt it can be faster (and less latent) than your own kernel.

It may be useful to take a step back and ponder whether a memory copy is in fact needed. It has been my experience with software performance optimization efforts (going back long before GPUs existed) that when asked “what is a fast way to copy memory?” the immediate counter question should be “why would you want to do that?”. Similarly for explicit matrix inversions or transpositions.

Moving data around accomplishes no real work, but costs time and energy. It is best avoided. Is there a zero-copy approach, maybe by passing a pointer? Can data movement be incorporated into other processing steps, e.g. by writing the output of a previous computational step to the desired locations?

Yes, a ping-pong buffer design with two streams would be even better… if your kernel dependencies support it.

Even simpler, as @njuffa points out, passing in a different pointer is going to always be cheaper than a D2D copy.

The fastest copy is no copy at all.

With modern processor architectures the throughput of computational cores has improved at dramatically higher rates than memory throughput (to say nothing of memory latency which leveled off years ago). We are now almost at the point where computations that have long been considered classic examples of computation-limited code, such as GEMM, are starting to approach the memory throughput limit. In addition, the electrical energy expended on a single memory access is now about an order of magnitude higher than that of an FMA.

As a consequence, high performance computing going forward will need to focus on efforts to minimize data movement, (at all level: on-chip, off-chip, and between cluster nodes) even at the cost of additional computation (e.g. on-the-fly compression an decompression) or redundant computation (when multiple threads or nodes compute identical results to avoid having to communicate).