Sure, here’s your message translated into English:
Hello CUDA Forum,
I am currently modifying a program and have encountered an issue. When I tested it with nvprof, the time spent on data transfers between the CPU and GPU in the GPU Activities section was only 1 second. However, in the API Calls section, cudaMemcpy took 4.75 seconds. I want to exclude the 1 second spent on the GPU, so where are the remaining 3.75 seconds coming from?
I then checked the APIs measured by both Nsight Compute and Nsight Systems. I found that although I only used one cudaMemcpy for packing a one-dimensional array, the software split this transfer into six separate transfers. I don’t understand why this is happening. Additionally, the extra 3.75 seconds in API Calls—does this mean the CPU was idle and not performing any computations? However, these seconds are included in the total program runtime. Are there some GPU activities that were not recorded?
Below are the data and graphs from the tools’ tests. Please help me review them and let me know if you notice any other issues. Thank you!
First, the testing data from nvprof;
secondly, the behavior of the APIs. The screenshots I captured include all data transfer activities before the kernel function flux_copydat_x is launched,
Most likely the virtual to pinned memory transfer on the host. DMA transfers to the device need to done from pinned host memory so there’s a copy from the virtual memory.
You can try adding the “pinned” attribute to the declaration of the allocatable arrays so they are allocated in physical as opposed to virtual memory. However while this will eliminate the need for the extra copy, there is additional overhead during allocation and it’s not guaranteed the OS will allocate the array in physical memory.
I declared pinned memory, and after reviewing the API behavior, I found that there is a lot of data transfer between two kernel functions. My kernel functions are all encapsulated in one module, and in this module, I didn’t write the code to allocate memory for the arrays used on the GPU. Instead, in the main program, both the host arrays and device arrays are allocated. The main program uses this module. The long CPU-to-GPU transfer times between the two kernel functions—could this be because the module cannot share the data from the main program, which then causes the system to automatically transfer data from the CPU to the GPU?
No, use associated variables should be visible and CUDA Fortran doesn’t implicitly copy data with the exception of Fortran descriptors if you’re passing in assume-shape arrays. Though, these are relatively small and the compiler can often optimize them away.
Since these descriptors can be optimized away, how should I compile the code? I ran a program where memory allocation, data transfer, and the kernel function were all encapsulated within a module, and there was no fragmented implicit transfer. So, I made this assumption. Could you advise how I should handle these descriptors to resolve the implicit transfers?
Could you advise how I should handle these descriptors to resolve the implicit transfers?
Now I don’t know if the memory transfers you’re seeing are indeed descriptors (you do have cudamemcpys), but if they are, then the best method is to pass in these arrays as assumed-size (i.e. the arg decared with “*”). This will then just pass in a pointer to the device array. The descriptor does provide the shape and bounds info, so you wont be able to use intrinsics that expect them (like LBOUND), nor use array syntax.