A "simple" question

Hello all. i have some question on CUDA programming.

Suppose i have an array with 10k elements . i want to do the following expression as
a[3] = a[5] + a [4].
a[4] = a[6] + a [5].
a[5] = a[7] + a [6].
a[6] = a[8] + a [7].
.
.
.
a[9998] = a[10000] + a[9999]

how do i arrange the thread and block to compute the above expression in parallel way??

thx a lot since this concept is very important for me to do the work… :wacko:

1 Partition the output among threadblocks.
2 Read all the data needed by the threadblock into shared memory
3 Each thread computes the sum (by fetching form shared memory) and saves to global memory

Make sure that 2 is coalesced, shouldn’t be difficult since your access is regular so every threadblock will need a contiguous region of memory. You may need to pad to ensure proper alignment.

Make sure that 3 saves directly to gmem. That way threads read from shared memory (unmodified input) and store to gmem. So, there will be no issues with threads reading updated data (incorrect behavior in your case) due to parallelism.

Paulius

thank you very much for your help. i am trying to study your answer. By the way, i try to write the kernel program that is

global void compute(double Six)
{
int natom = 10000;
int aBegin = blockIdx.x
BLOCK_SIZEnatom;
int aEnd = aBegin + natom -1;
int idx = blockIdx.x
blockDim.x+threadIdx.x;
int astep = BLOCK_SIZE;

for(int i = aBegin; i <=aEnd; i+=astep)
{

__shared__double As[BLOCK_SIZE];

As[threadIdx.x] = Six[i+threadIdx.x];

__syncthreads();

for(int k = 0;k<BLOCK_SIZE;++k)
{
As[threadIdx.x] = As[threadIdx.x+2]+ As[threadIdx.x+1];

__syncthreads();
}
}

do this work?? please give me any comment :o