one dimensional circular spin program

Hello all , i am new in GPU program and try to write the GPU program to compute the average energy and the motion of the system of spin (say 10000 atoms)

the comcept is that i select a group of spin(group 1) that are independent from each other to a “block”. This group of spins have their own energy and motion value. however, to compute the other group of spins(group 2), their values are depend on group 1 values.

i dont have any idea that after the GPU finish to compute one block , then put those value to the global memony…finally get back those values to compute another blocks…

do anyone please give me some idea ? thanks a lot~~

You can’t. The order of block execution is undefined. You can only be sure that all be sure that all blocks are done and global memory writes have completed by letting the kernel finish. Your best bet is to do all the completely independent parts of block1 thru block N, write out partial results to global memory. Then run a smaller kernel that can finish up the dependancies.

I was confused by…

Did you mean group 2 spins depend on group 1 values?

Since you can’t serialize the CUDA blocks can you create independent kernels to operate on each group in turn? You could serialize the execution of the kernels.

To wildcat4096

sorry…i modified the topic again yesterday :rolleyes:

To MisterAnderson42

thanks for your suggestion. Now i have one question…i know how to allocate the whole array to the host and device memory, but i need to take the specific location of the elements in the array for compution…Is it possible?

for example: i want the kernel to run the below instruction in parallel mode:
for(int i = 0; i <10; ++i)
w[i] = 3*(x[i+1] + y[i+1]);

for(int i = 0; i <10; ++i)
w[i+1] = 3*(x[i+2] + y[i+2]);

and w[i] and w[i+1] are independent.

if i already allocate the whole arrays w[i] , x[i] and y[i] in the device memory, how do i construct the GPU for parallel programming…i am very confuing now…please help me <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

How are w[i] and w[i+1] independant? You have the two loops computing the same thing and writing it to w[i], just that the 2nd loop starts at iteration 1 instead of 0.

Do you just need to compute w[i] for all particles i? That’s easy: just have each thread calculate w[i] for a single particle.

int idx = blockDim.x * blockIdx.x + threadIdx.x;

if (idx < N)

    w[idx] = 3*(x[idx+1] + y[idx+1]);

Then launch ceil(N/block_size) blocks.

Of course, the memory reads are not coalesced, so it will be slow. But the example demonstrates how you can break up the computation of w[i] into threads.

i think i am misleading you …sorry, let’s me try to state my question again :P
for example:

i construct the one dimensional array a[i] which have 10 elements.

0.4 0.7 1.0 0.3 0.67 0.55 0.44 0.12 0.79 0.41
a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9]

due to some mathematics and physical conditions, element(in fact this is the value of energy of a spin) a[3] depend on the value a[1] a[2] a[4] a[5].
a[6] depend on the value a[4] a[5] a[7] a[8]. however, a[3] and a[6] are independent.

my concept is that i have already allocate this array to the share momery and want the GPU to compute a[3] and a[6] in parallel.

But how do i construct the GPU to compute 2 equations in parallel, for example ,

a[3] = a[1] + a[2]+ a[4] + a[5]

a[6] = a[4] + a[5]+ a[7] + a[8]

but i dont know how to put the elements a[3] and a[6] only but not the whole arrays to a block…

thanks you very much for your help External Image

I hesitate to even make the following suggestion out of fear of not having understood your problem…

if(threadIdx.x == 3) {

  a[3] = a[1] + a[2]+ a[4] + a[5];

} else if(threadIdx.x == 6) {

  a[6] = a[4] + a[5]+ a[7] + a[8];

}