the subarray usage in copy

luxuia · April 18, 2013, 2:45am

I try to use like

 copyin(arr[0:total/2], arr[total/2:total/2])

instead of

copyin(arr[0:total])

as i have big data and have to split data.
but the ans use second is right, first is wrong.
I dont know what happened.
thanks ahead

luxuia · April 18, 2013, 2:58am

I do a sample test like is:

 int arr[100];
        for (int i = 0; i <  100; ++i) {
                arr[i] = i*2;
        }
        int ans[100];

        #pragma acc kernels copyin(arr[0:50], arr[50:50]), copyout(ans[100])
        for (int i = 0; i < 100; ++i) {
                ans[i] = arr[i];
        }
        for (int i = 0; i < 100; ++i) {
                fprintf(stderr, "%d\t", ans[i]);
        }

and find arr[0:50] do not actually copyed into device memory.

what should i do to copyin two subset of an array.
thanks.

luxuia · April 18, 2013, 3:37am

copyin(temp2) followed by copy[in](temp2); copyin attribute ignored

as the msg says,

what can i do…:(

deeply said.-.-

luxuia · April 18, 2013, 3:53am

I do not think i find a solution, but it is very ugly…

int *_arr = arr;

so i can use _arr instead of arr when warning occur…
but it is very slowly…
I was afraid of seeing the time command …
any other way to do???[/code]

MatColgrove · April 18, 2013, 4:22pm

Hi luxuia,

No, you can’t put the same variable more than once in a data copy clause. There has to be a single base host address from which to map the host and device data.

Though, I question why you need to break apart the array this way. Can you please clarify what you mean by “as i have big data” and why you believe it needs to be split?

If you don’t have enough memory on the device to hold the entire array, breaking it apart in the data clause wont help. In this case, you need to strip mine the loop into a block size that fits on the device. Something like:

% cat test_block.c 
#include <stdio.h>

int main () {
        const int size = 100;
        int arr[size];
        int ans[size];
        int block_size = 50;
        for (int i = 0; i <  size; ++i) {
                arr[i] = i*2;
        }

        for (int b=0; b < size; b+=block_size) {
#pragma acc kernels copyin(arr[b:block_size]), copyout(ans[b:block_size])
        	for (int i = b; i < b+block_size; ++i) {
                	ans[i] = arr[i];
        	}
	}
        for (int i = 0; i < size; ++i) {
                printf("%d: %d\n", i, ans[i]);
        } 
	exit(0);
}

sb-colgrove:/local/home/colgrove% pgcc test_block.c -acc -Minfo=accel
main:
     12, Generating copyout(ans[b:block_size])
         Generating copyin(arr[b:block_size])
     13, Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     14, Loop is parallelizable
         Accelerator kernel generated
         14, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
% setenv PGI_ACC_NOTIFY 1
% a.out
launch CUDA kernel  file=../test_block.c function=main line=14 device=0 grid=1 block=64
launch CUDA kernel  file=../test_block.c function=main line=14 device=0 grid=1 block=64
0: 0
1: 2
2: 4
3: 6
4: 8
5: 10
6: 12
7: 14
8: 16
....
94: 188
95: 190
96: 192
97: 194
98: 196
99: 198

Hope this helps,
Mat