Request for Suggestions on Optimizing CPU-GPU Data Transfer

1668115800 · January 8, 2025, 10:45am

Hello NVIDIA Forum,

I am seeking optimization suggestions for my current project. I am working on optimizing the CPU-GPU data transfer time. My task involves uploading four 3D arrays from the CPU to the GPU for computation. On the GPU, I use 2D thread blocks to perform the computation, and then transfer the results back to the CPU. Currently, the data transfer time takes up the majority of the overall execution time.

I am considering the following optimization strategy: loading the 3D arrays into 1D arrays and merging the four arrays into one. Then, I plan to use three streams for asynchronous data transfer. On the GPU, I would use 2D thread blocks to access the 1D arrays. After computation, I would process the results into 1D arrays and transfer them back to the CPU.

Here are the time breakdowns for each part of the process. Do you think this optimization approach is correct? Are there any better suggestions or alternatives?

Thank you!
Type Time(%) Time Calls Avg Min Max Name

GPU activities: 67.89% 1.06467s 239976 4.4360us 671ns 15.487us [CUDA memcpy HtoD]

                        16.09%  252.32ms     79992  3.1540us  3.0710us  3.9360us  [CUDA memcpy DtoH]

                     16.02%  251.30ms     19998  12.566us  11.071us  32.064us  flux_dxee_

API calls: 93.06% 4.19244s 319968 13.102us 3.0600us 1.8282ms cudaMemcpy

                2.75%  123.72ms      9999  12.373us  2.1280us  31.608us  cudaDeviceSynchronize

                2.52%  113.38ms     19998  5.6690us  4.7960us  534.75us  cudaLaunchKernel

                1.65%  74.122ms         1  74.122ms  74.122ms  74.122ms  cuDevicePrimaryCtxRetain

                0.02%  1.0937ms         1  1.0937ms  1.0937ms  1.0937ms  cuMemAllocHost

                0.00%  174.40us        17  10.258us  1.9800us  119.12us  cuMemAlloc

                0.00%  128.88us       120  1.0740us      92ns  49.904us  cuDeviceGetAttribute

                0.00%  108.79us       412     264ns     105ns  7.5850us  cuGetProcAddress

                0.00%  14.363us         1  14.363us  14.363us  14.363us  cuDeviceGetName

                0.00%  7.7920us         2  3.8960us     205ns  7.5870us  cuCtxGetCurrent

                0.00%  5.2400us         1  5.2400us  5.2400us  5.2400us  cuDeviceGetPCIBusId

                0.00%  2.8640us         1  2.8640us  2.8640us  2.8640us  cudaGetDevice

                0.00%  2.1140us         4     528ns     143ns  1.1560us  cuCtxSetCurrent

                0.00%  1.7970us         1  1.7970us  1.7970us  1.7970us  cuInit

                0.00%  1.5350us         4     383ns     124ns  1.0310us  cuDeviceGetCount

                0.00%     699ns         4     174ns     107ns     239ns  cuDriverGetVersion

                0.00%     685ns         1     685ns     685ns     685ns  cuDeviceComputeCapability

                0.00%     675ns         3     225ns     135ns     322ns  cuDeviceGet

                0.00%     460ns         1     460ns     460ns     460ns  cuDeviceTotalMem

                0.00%     172ns         1     172ns     172ns     172ns  cuDeviceGetUuid

                0.00%     153ns         1     153ns     153ns     153ns  cuModuleGetLoadingMode

OpenACC (excl): 100.00% 21.609us 1 21.609us 21.609us 21.609us acc_device_init

                0.00%       0ns    159992       0ns       0ns       0ns  acc_delete

                0.00%       0ns        16       0ns       0ns       0ns  acc_alloc

Robert_Crovella · January 9, 2025, 2:25am

Host<->Device transfer time can be improved by using pinned host buffers. This has side effects (additional allocation time) so for a single use its not much of a win in my experience. But it certainly pays dividends with repeated use and will be important for a later comment of mine.

Naturally, reducing transfers to the minimum necessary (whatever that may mean for data organization) and calling the fewest number of memcpy-type calls is usually a good thing.

Finally, a standard CUDA optimization method is overlap of copy and compute. The general methodology is covered in unit 7 of this tutorial series. For this overlap, pinned buffers are necessary. The basic idea is to break the work into chunks, and process a chunk at a time. In so doing, the overal cost of copying data to and from the device can be overlapped, potentially reducing wall clock time to calculate the result.

Yes, that may be a good idea, consistent with my statements above.

Yes, by that if you mean canonical overlap of copy and compute, that is generally a good idea.

Topic		Replies	Views
how to speed up? data transfer CUDA Programming and Performance	22	3812	April 5, 2011
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2242	January 18, 2023
performance problem CUDA Programming and Performance	2	611	July 16, 2018
Copies between CPU and GPU CUDA Programming and Performance	8	5360	November 3, 2009
Improving data transfer performance from host to device CUDA Programming and Performance	2	2072	January 28, 2015
Data transfer speed between CPU and GPU CUDA Programming and Performance	5	15444	October 25, 2011
Transfer data between host and device dynamicly? Maybe it's a problem. CUDA Programming and Performance	12	5271	April 2, 2008
Unpredictable nature of GPU action timing in Nsight CUDA Programming and Performance	10	1203	October 18, 2015
The GPU utilization is low CUDA Programming and Performance	3	2033	November 14, 2014
How to reduce the overhead from cudaStreamSynchronize? CUDA Programming and Performance	2	586	June 10, 2021

Request for Suggestions on Optimizing CPU-GPU Data Transfer

Related topics