assignment (device->host) performance issue

Suppose I have two arrays

arr_host

arr_dev

one reside on host and one reside on device. If I do copy assignment like

arr_host = arr_dev

there is no performance penalty (with slightly different between arr_host on regular space or pinned memory space). In my code, the data is 22MB each copy so it takes 10min (vs 13min on pinned memory). However, if i specify the index

arr_host = arr_dev(1:sizeof(arr_host))

or

arr_host = arr_dev(padding+1:)
// given that arr_dev was allocated bigger than arr_host

there is a dramatically performance difference (about 3 times slower). So, I think PGI should revise the copy assignment

Thanks,
Tuan

Hi Tuan,

We’re aware of this issue and are making progress.

  • Mat

Any work-around solution by now Mat?

Tuan

Hi Tuan,

Any work-around solution by now Mat?

Avoid using array sections and only copy the entire array.

Using array sections forces the compiler to generate multiple copies since there isn’t a general way at compile time to know the best method to copy the data. It’s a very difficult problem since array sections can be defined by any number of expressions that can only be evaluated at runtime. What we’re working on now is a way to determine at runtime the optimal way to copy the data. Finding a general solution will take some time.

  • Mat

how about using CUDA API: cudaMemCpy() or related ones.
does it have the same problem?

Thanks,
Tuan