assignment (device->host) performance issue

Tuan · February 15, 2011, 8:05pm

Suppose I have two arrays

arr_host

arr_dev

one reside on host and one reside on device. If I do copy assignment like

arr_host = arr_dev

there is no performance penalty (with slightly different between arr_host on regular space or pinned memory space). In my code, the data is 22MB each copy so it takes 10min (vs 13min on pinned memory). However, if i specify the index

arr_host = arr_dev(1:sizeof(arr_host))

or

arr_host = arr_dev(padding+1:)
// given that arr_dev was allocated bigger than arr_host

there is a dramatically performance difference (about 3 times slower). So, I think PGI should revise the copy assignment

Thanks,
Tuan

MatColgrove · February 16, 2011, 12:36am

Hi Tuan,

We’re aware of this issue and are making progress.

Mat

Tuan · February 16, 2011, 3:27pm

Any work-around solution by now Mat?

Tuan

MatColgrove · February 16, 2011, 6:48pm

Hi Tuan,

Any work-around solution by now Mat?

Avoid using array sections and only copy the entire array.

Using array sections forces the compiler to generate multiple copies since there isn’t a general way at compile time to know the best method to copy the data. It’s a very difficult problem since array sections can be defined by any number of expressions that can only be evaluated at runtime. What we’re working on now is a way to determine at runtime the optimal way to copy the data. Finding a general solution will take some time.

Mat

Tuan · February 16, 2011, 9:53pm

how about using CUDA API: cudaMemCpy() or related ones.
does it have the same problem?

Thanks,
Tuan

Topic		Replies	Views
CUDA Fortran host device=device assignment Legacy PGI Compilers	1	8376	April 2, 2010
Confusion whilst copying from host to device Legacy PGI Compilers	2	2412	July 4, 2012
Slow data transfer and memory alloaction. Legacy PGI Compilers	4	2881	October 5, 2011
Time for coping array to device Legacy PGI Compilers	2	2614	April 24, 2012
copy data from device Legacy PGI Compilers	2	3218	June 21, 2012
Doubts about subarray copy to and from device Legacy PGI Compilers	2	2492	May 31, 2011
the subarray usage in copy Legacy PGI Compilers	4	2304	April 18, 2013
Really strange memcpy time in matrixMul at SDK CUDA Programming and Performance	6	5119	July 9, 2009
Array copy optimize Legacy PGI Compilers	5	13481	February 12, 2014
Copies between CPU and GPU CUDA Programming and Performance	8	5388	November 3, 2009

assignment (device->host) performance issue

Related topics