How can we efficiently perform batch copy from CPU to GPU, initiated by the CPU?or using an asynchronous approach

I believe I need a method to efficiently copy a large and scattered dataset from CPU memory to GPU memory (or cache). Currently, there are two techniques available:

  1. Use CPU read and write to a contiguous space, followed by cudamemcpy to transfer the data to GPU memory.
  • The drawback of this approach is the significant overhead associated with CPU read and write operations.
  1. Utilize GPU zero copy to directly transfer data from CPU memory to GPU memory (cache).
  • The drawback of this approach is that the zero copy process is not asynchronous and continuously occupies GPU threads. I would like to avoid excessive utilization of GPU threads during the copy process.

So, what I would like to ask is whether there is a way for CUDA to support CPU-initialized zero copy or for CUDA to use async memcpy without occupying thread resources. In summary, I would like to inquire about CUDA’s support for efficient copying of large amounts of scattered data.

I don’t think you’re going to find anything more efficient than what you have already outlined.

If your usage pattern permits it, you could investigate async copies from global to shared that are available in Ampere and Hopper. I don’t know for certain that this methodology works from pinned memory but I think it should; it is the global space. I’d be very surprised if this methodology provided any positive benefit for the scattered case you describe, however.

The GPU is a latency hiding machine. Therefore, a way to think about the cost of either of these approaches is whether or not you can do other meaningful/useful work (on the GPU) while these are unfolding. If so, the concerns you have raised may not be things that actually impact the performance of the code.

Using pinned host memory (part of the zero-copy methodology) makes the data accessible to GPU code. If you only need to access the data once, there is no need to copy it. And, again, the GPU generally can have a large amount of threads in flight. Latency hiding means giving the GPU enough available work to do, so that the latency of other operations does not become performance-sensitive for the code.

Thank you! In my scenario, while MPS is running other stages in the pipeline, I want to avoid the zero copy process occupying too many GPU threads, preferably none at all. Therefore, the latency hiding technique I mentioned may not be suitable in my case. However, your suggestion of using asynchronous copy operations, specifically the first point, is worth considering.

After providing this additional requirement, do you have any further thoughts or ideas? I am currently searching for some batch copy APIs for scattered data. thanks a lot

Experimenting and profiling will be your friend here, as there is too little information to make remote predictions over the internet with any certainty.

In many cases, I would expect approach (1) to work the best, as system memory bandwidth is generally less constrained than PCIe bandwidth. Much will depend on the specifics of the spatial and temporal access patterns, the sizes of the scattered chunks, and the throughput of the system memory. Approach (1) should gain advantage with six-channel or eight-channel DDR4-3200 system memory typically found on high-end systems (or DDR5 if the system is very recent).

Your point is valid, but I still have a question regarding Method 1, which incurs expensive CPU read/write overhead, while Method 2 eliminates this overhead. If I understand correctly, does this mean that Method 2 is always superior to Method 1?(especially for scattered data)

I do not know what kind of system you are looking at. Modern workstation-class systems offer system memory bandwidth between 100 and 200 GB/sec, unidirectional PCIe bandwidth of 25 GB/sec, and GPU memory bandwidth around 800 to1000 GB/sec. Also, system memory latency is typically quite low, around 60ns to 80ns. Given that, performing a gather operation on the host side does not seem scary to me and provides for the most efficient use of the “narrowest straw” in this particular data movement scenario.

But beyond any conjecture you would want to experiment with, characterize, and profile any of the variants you are contemplating. I do not know what that entails in your specific context, but in my thinking this is something that can be investigated in the span of a couple of days, so that is what I would recommend.

You might also want to look into alternative, more compact and less scattered, storage arrangements on the host.

Thanks! I will conduct some experiments to validate your observations. By the way, in my scenario, the bottleneck lies in extracting node features from the sampled data for GNN training.