How can we efficiently perform batch copy from CPU to GPU, initiated by the CPU?or using an asynchronous approach

yichuanjiaoda · June 16, 2023, 5:36pm

I believe I need a method to efficiently copy a large and scattered dataset from CPU memory to GPU memory (or cache). Currently, there are two techniques available:

Use CPU read and write to a contiguous space, followed by cudamemcpy to transfer the data to GPU memory.

The drawback of this approach is the significant overhead associated with CPU read and write operations.

Utilize GPU zero copy to directly transfer data from CPU memory to GPU memory (cache).

The drawback of this approach is that the zero copy process is not asynchronous and continuously occupies GPU threads. I would like to avoid excessive utilization of GPU threads during the copy process.

So, what I would like to ask is whether there is a way for CUDA to support CPU-initialized zero copy or for CUDA to use async memcpy without occupying thread resources. In summary, I would like to inquire about CUDA’s support for efficient copying of large amounts of scattered data.

Robert_Crovella · June 16, 2023, 5:56pm

I don’t think you’re going to find anything more efficient than what you have already outlined.

If your usage pattern permits it, you could investigate async copies from global to shared that are available in Ampere and Hopper. I don’t know for certain that this methodology works from pinned memory but I think it should; it is the global space. I’d be very surprised if this methodology provided any positive benefit for the scattered case you describe, however.

The GPU is a latency hiding machine. Therefore, a way to think about the cost of either of these approaches is whether or not you can do other meaningful/useful work (on the GPU) while these are unfolding. If so, the concerns you have raised may not be things that actually impact the performance of the code.

Using pinned host memory (part of the zero-copy methodology) makes the data accessible to GPU code. If you only need to access the data once, there is no need to copy it. And, again, the GPU generally can have a large amount of threads in flight. Latency hiding means giving the GPU enough available work to do, so that the latency of other operations does not become performance-sensitive for the code.

yichuanjiaoda · June 16, 2023, 6:21pm

Thank you! In my scenario, while MPS is running other stages in the pipeline, I want to avoid the zero copy process occupying too many GPU threads, preferably none at all. Therefore, the latency hiding technique I mentioned may not be suitable in my case. However, your suggestion of using asynchronous copy operations, specifically the first point, is worth considering.

After providing this additional requirement, do you have any further thoughts or ideas? I am currently searching for some batch copy APIs for scattered data. thanks a lot

njuffa · June 16, 2023, 7:32pm

Experimenting and profiling will be your friend here, as there is too little information to make remote predictions over the internet with any certainty.

In many cases, I would expect approach (1) to work the best, as system memory bandwidth is generally less constrained than PCIe bandwidth. Much will depend on the specifics of the spatial and temporal access patterns, the sizes of the scattered chunks, and the throughput of the system memory. Approach (1) should gain advantage with six-channel or eight-channel DDR4-3200 system memory typically found on high-end systems (or DDR5 if the system is very recent).

yichuanjiaoda · June 17, 2023, 3:57am

Your point is valid, but I still have a question regarding Method 1, which incurs expensive CPU read/write overhead, while Method 2 eliminates this overhead. If I understand correctly, does this mean that Method 2 is always superior to Method 1?(especially for scattered data)

njuffa · June 17, 2023, 5:15am

I do not know what kind of system you are looking at. Modern workstation-class systems offer system memory bandwidth between 100 and 200 GB/sec, unidirectional PCIe bandwidth of 25 GB/sec, and GPU memory bandwidth around 800 to1000 GB/sec. Also, system memory latency is typically quite low, around 60ns to 80ns. Given that, performing a gather operation on the host side does not seem scary to me and provides for the most efficient use of the “narrowest straw” in this particular data movement scenario.

But beyond any conjecture you would want to experiment with, characterize, and profile any of the variants you are contemplating. I do not know what that entails in your specific context, but in my thinking this is something that can be investigated in the span of a couple of days, so that is what I would recommend.

You might also want to look into alternative, more compact and less scattered, storage arrangements on the host.

yichuanjiaoda · June 17, 2023, 5:52am

njuffa:

I do not know what kind of system you are looking at. Modern workstation-class systems offer system memory bandwidth between 100 and 200 GB/sec, unidirectional PCIe bandwidth of 25 GB/sec, and GPU memory bandwidth around 800 to1000 GB/sec. Also, system memory latency is typically quite low, around 60ns to 80ns. Given that, performing a gather operation on the host side does not seem scary to me and provides for the most efficient use of the “narrowest straw” in this particular data movement scenario.

But beyond any conjecture you would want to experiment with, characterize, and profile any of the variants you are contemplating. I do not know what that entails in your specific context, but in my thinking this is something that can be investigated in the span of a couple of days, so that is what I would recommend.

You might also want to look into alternative, more compact and less scattered, storage arrangements on the host.

Thanks! I will conduct some experiments to validate your observations. By the way, in my scenario, the bottleneck lies in extracting node features from the sampled data for GNN training.

Topic		Replies	Views
Highly varying copy throughput from/to pinned to/from pageable memory CUDA Programming and Performance cuda	9	1206	July 10, 2020
How to transfer massive data efficiently? CUDA Programming and Performance	5	5829	April 16, 2015
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9733	September 22, 2007
CUDA device memory access? CUDA Programming and Performance	11	15700	August 5, 2011
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13128	March 30, 2011
Copies between CPU and GPU CUDA Programming and Performance	8	5356	November 3, 2009
Data transfer speed between CPU and GPU CUDA Programming and Performance	5	15442	October 25, 2011
Zero Copy performance problem CUDA Programming and Performance	4	2075	July 6, 2021
Is cudaMemcpy() real-time safe? CUDA Programming and Performance cuda	11	555	March 30, 2024
Improving data transfer performance from host to device CUDA Programming and Performance	2	2064	January 28, 2015

How can we efficiently perform batch copy from CPU to GPU, initiated by the CPU?or using an asynchronous approach

Related topics