Request for Suggestions on Optimizing CPU-GPU Data Transfer

Hello NVIDIA Forum,

I am seeking optimization suggestions for my current project. I am working on optimizing the CPU-GPU data transfer time. My task involves uploading four 3D arrays from the CPU to the GPU for computation. On the GPU, I use 2D thread blocks to perform the computation, and then transfer the results back to the CPU. Currently, the data transfer time takes up the majority of the overall execution time.

I am considering the following optimization strategy: loading the 3D arrays into 1D arrays and merging the four arrays into one. Then, I plan to use three streams for asynchronous data transfer. On the GPU, I would use 2D thread blocks to access the 1D arrays. After computation, I would process the results into 1D arrays and transfer them back to the CPU.

Here are the time breakdowns for each part of the process. Do you think this optimization approach is correct? Are there any better suggestions or alternatives?

Thank you!
Type Time(%) Time Calls Avg Min Max Name

GPU activities: 67.89% 1.06467s 239976 4.4360us 671ns 15.487us [CUDA memcpy HtoD]

                        16.09%  252.32ms     79992  3.1540us  3.0710us  3.9360us  [CUDA memcpy DtoH]

                     16.02%  251.30ms     19998  12.566us  11.071us  32.064us  flux_dxee_

API calls: 93.06% 4.19244s 319968 13.102us 3.0600us 1.8282ms cudaMemcpy

                2.75%  123.72ms      9999  12.373us  2.1280us  31.608us  cudaDeviceSynchronize

                2.52%  113.38ms     19998  5.6690us  4.7960us  534.75us  cudaLaunchKernel

                1.65%  74.122ms         1  74.122ms  74.122ms  74.122ms  cuDevicePrimaryCtxRetain

                0.02%  1.0937ms         1  1.0937ms  1.0937ms  1.0937ms  cuMemAllocHost

                0.00%  174.40us        17  10.258us  1.9800us  119.12us  cuMemAlloc

                0.00%  128.88us       120  1.0740us      92ns  49.904us  cuDeviceGetAttribute

                0.00%  108.79us       412     264ns     105ns  7.5850us  cuGetProcAddress

                0.00%  14.363us         1  14.363us  14.363us  14.363us  cuDeviceGetName

                0.00%  7.7920us         2  3.8960us     205ns  7.5870us  cuCtxGetCurrent

                0.00%  5.2400us         1  5.2400us  5.2400us  5.2400us  cuDeviceGetPCIBusId

                0.00%  2.8640us         1  2.8640us  2.8640us  2.8640us  cudaGetDevice

                0.00%  2.1140us         4     528ns     143ns  1.1560us  cuCtxSetCurrent

                0.00%  1.7970us         1  1.7970us  1.7970us  1.7970us  cuInit

                0.00%  1.5350us         4     383ns     124ns  1.0310us  cuDeviceGetCount

                0.00%     699ns         4     174ns     107ns     239ns  cuDriverGetVersion

                0.00%     685ns         1     685ns     685ns     685ns  cuDeviceComputeCapability

                0.00%     675ns         3     225ns     135ns     322ns  cuDeviceGet

                0.00%     460ns         1     460ns     460ns     460ns  cuDeviceTotalMem

                0.00%     172ns         1     172ns     172ns     172ns  cuDeviceGetUuid

                0.00%     153ns         1     153ns     153ns     153ns  cuModuleGetLoadingMode

OpenACC (excl): 100.00% 21.609us 1 21.609us 21.609us 21.609us acc_device_init

                0.00%       0ns    159992       0ns       0ns       0ns  acc_delete

                0.00%       0ns        16       0ns       0ns       0ns  acc_alloc

Host<->Device transfer time can be improved by using pinned host buffers. This has side effects (additional allocation time) so for a single use its not much of a win in my experience. But it certainly pays dividends with repeated use and will be important for a later comment of mine.

Naturally, reducing transfers to the minimum necessary (whatever that may mean for data organization) and calling the fewest number of memcpy-type calls is usually a good thing.

Finally, a standard CUDA optimization method is overlap of copy and compute. The general methodology is covered in unit 7 of this tutorial series. For this overlap, pinned buffers are necessary. The basic idea is to break the work into chunks, and process a chunk at a time. In so doing, the overal cost of copying data to and from the device can be overlapped, potentially reducing wall clock time to calculate the result.

Yes, that may be a good idea, consistent with my statements above.

Yes, by that if you mean canonical overlap of copy and compute, that is generally a good idea.