Hello NVIDIA Forum,
I am seeking optimization suggestions for my current project. I am working on optimizing the CPU-GPU data transfer time. My task involves uploading four 3D arrays from the CPU to the GPU for computation. On the GPU, I use 2D thread blocks to perform the computation, and then transfer the results back to the CPU. Currently, the data transfer time takes up the majority of the overall execution time.
I am considering the following optimization strategy: loading the 3D arrays into 1D arrays and merging the four arrays into one. Then, I plan to use three streams for asynchronous data transfer. On the GPU, I would use 2D thread blocks to access the 1D arrays. After computation, I would process the results into 1D arrays and transfer them back to the CPU.
Here are the time breakdowns for each part of the process. Do you think this optimization approach is correct? Are there any better suggestions or alternatives?
Thank you!
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 67.89% 1.06467s 239976 4.4360us 671ns 15.487us [CUDA memcpy HtoD]
16.09% 252.32ms 79992 3.1540us 3.0710us 3.9360us [CUDA memcpy DtoH]
16.02% 251.30ms 19998 12.566us 11.071us 32.064us flux_dxee_
API calls: 93.06% 4.19244s 319968 13.102us 3.0600us 1.8282ms cudaMemcpy
2.75% 123.72ms 9999 12.373us 2.1280us 31.608us cudaDeviceSynchronize
2.52% 113.38ms 19998 5.6690us 4.7960us 534.75us cudaLaunchKernel
1.65% 74.122ms 1 74.122ms 74.122ms 74.122ms cuDevicePrimaryCtxRetain
0.02% 1.0937ms 1 1.0937ms 1.0937ms 1.0937ms cuMemAllocHost
0.00% 174.40us 17 10.258us 1.9800us 119.12us cuMemAlloc
0.00% 128.88us 120 1.0740us 92ns 49.904us cuDeviceGetAttribute
0.00% 108.79us 412 264ns 105ns 7.5850us cuGetProcAddress
0.00% 14.363us 1 14.363us 14.363us 14.363us cuDeviceGetName
0.00% 7.7920us 2 3.8960us 205ns 7.5870us cuCtxGetCurrent
0.00% 5.2400us 1 5.2400us 5.2400us 5.2400us cuDeviceGetPCIBusId
0.00% 2.8640us 1 2.8640us 2.8640us 2.8640us cudaGetDevice
0.00% 2.1140us 4 528ns 143ns 1.1560us cuCtxSetCurrent
0.00% 1.7970us 1 1.7970us 1.7970us 1.7970us cuInit
0.00% 1.5350us 4 383ns 124ns 1.0310us cuDeviceGetCount
0.00% 699ns 4 174ns 107ns 239ns cuDriverGetVersion
0.00% 685ns 1 685ns 685ns 685ns cuDeviceComputeCapability
0.00% 675ns 3 225ns 135ns 322ns cuDeviceGet
0.00% 460ns 1 460ns 460ns 460ns cuDeviceTotalMem
0.00% 172ns 1 172ns 172ns 172ns cuDeviceGetUuid
0.00% 153ns 1 153ns 153ns 153ns cuModuleGetLoadingMode
OpenACC (excl): 100.00% 21.609us 1 21.609us 21.609us 21.609us acc_device_init
0.00% 0ns 159992 0ns 0ns 0ns acc_delete
0.00% 0ns 16 0ns 0ns 0ns acc_alloc