Why are there still so many CPU to GPU data transfer operations between the two kernel functions, occupying 80% of the entire cudaMemcpy API?

1668115800 · January 13, 2025, 1:51pm

Hello CUDA community! When compiling my CUDA Fortran program, I placed all the memory allocation and data transfer operations in the main program, and only encapsulated the kernel functions within a module. The issue I encountered is that there are multiple data transfers from CPU to GPU between two kernel function calls. I suspect this might be where the error occurs, but I’m not sure about the exact cause. So, my question is: should I encapsulate memory allocation, data transfer, and kernel functions all within the module?

MatColgrove · January 13, 2025, 6:23pm

Again, it’s difficult to say without a reproducing example. Best guess is that it’s the Fortran descriptor being copied over as we’ve discussed in your other posts.

Topic		Replies	Views
Why is there still several CPU-to-GPU transfers between two consecutive kernel calls in CUDA? nvc, nvc++ and nvfortran	2	14	January 13, 2025
How to Overlap Data Transfers in CUDA Fortran Technical Blog	0	405	August 25, 2020
Newbie question about data transfer CUDA Programming and Performance	4	2701	July 25, 2008
"Why cudaMemcpy Shows Higher Duration in API Calls Than GPU Activities in nvprof?" nvc, nvc++ and nvfortran	5	18	January 13, 2025
Calling CUDA C from fortran CUDA Programming and Performance	4	877	December 4, 2021
Why cuda kernel computation cannot overlap with CPU to GPU data transfer? CUDA Programming and Performance cuda , kernel , pytorch	1	187	May 21, 2024
Hint: Keep all CUBLAS functions in one thread CUDA Programming and Performance	1	5121	September 25, 2007
How to pass variables to different kernal functions via global variables? CUDA Programming and Performance	8	3018	June 9, 2010
Managing device memory in C and passing pointers in Fortran Legacy PGI Compilers	1	2318	March 11, 2016
A little help with Multi-GPU example please :) How do I pass data to each GPU? CUDA Programming and Performance	8	28003	March 4, 2012

Why are there still so many CPU to GPU data transfer operations between the two kernel functions, occupying 80% of the entire cudaMemcpy API?

Related topics