one reside on host and one reside on device. If I do copy assignment like
arr_host = arr_dev
there is no performance penalty (with slightly different between arr_host on regular space or pinned memory space). In my code, the data is 22MB each copy so it takes 10min (vs 13min on pinned memory). However, if i specify the index
arr_host = arr_dev(1:sizeof(arr_host))
or
arr_host = arr_dev(padding+1:)
// given that arr_dev was allocated bigger than arr_host
there is a dramatically performance difference (about 3 times slower). So, I think PGI should revise the copy assignment
Avoid using array sections and only copy the entire array.
Using array sections forces the compiler to generate multiple copies since there isn’t a general way at compile time to know the best method to copy the data. It’s a very difficult problem since array sections can be defined by any number of expressions that can only be evaluated at runtime. What we’re working on now is a way to determine at runtime the optimal way to copy the data. Finding a general solution will take some time.