Memory copy is not continuous

When I used openacc to optimize my code, I found two problems,
First, from the picture point of view, I used deepcopy to copy the data, but you can see that it still takes a long time to HtoD. And the time is not continuous, which wastes a lot of time. I copied a lot of arrays at the same time. I wrote in the program:
! $acc enter data copyin (six,siy,siz,vol,xti,yti,zti…),
So is there any way to make the copy run continuously? This really wastes a lot of time, and when I zoom in on the details, I found that datacopyout is not so continuous when I execute each acc loop. This really wastes a lot of time. Is there any good way?
The second problem is that when I watched the pgi official website tutorial, using pgprof, when choosing the analyis option, I can’t make various analysis diagrams like the tutorial. My pgprof only shows some general descriptions.
How can I know which part of the work is most time consuming, which array is being copied?
Where is the problem? The GPU I am using is NVIDIA GTX750ti and teslaK40 (3)

[/img]
another question:how to add picture ?

Hi wanghr323,

So is there any way to make the copy run continuously?

That would depend on your data structure. The runtime can only copy data as contiguous blocks, so if your data is not contiguous, then it would be copied in multiple segments.

Note that by default, the runtime uses a double buffering system to copy data to/from the device. Since data must be in physical (pinned) memory, the runtime first copies the data from virtual memory to a pinned buffer, begins the transfer, and then starts filling the second buffer. This allows for more asynchronous data transfers.

To avoid needing to use the buffers, you can try using the flag “-ta=tesla:pinned” to have the data allocated in pinned memory. Though there is a high allocation and deallocation cost to using pinned memory so it’s typically only beneficial when there are few data allocation (via the data directive) and many transfers (via the update directive).

How can I know which part of the work is most time consuming, which array is being copied?

The profiler wont have information about which arrays are being copied. For that, I would suggest using the PGI environment variable PGI_ACC_NOTIFY=2, which will have the runtime print whenever it copies a variable.

The argument to PGI_ACC_NOTIFY is a bit mask. So setting it to “1” will show all the kernel launches, “2” the data movement, and “3” will show both.

More info on PGI_ACC_NOTIFY can be found here: OpenACC Getting Started :: PGI version 18.7 Documentation for x86 and NVIDIA Processors

Hope this helps,
Mat