Where to put __syncthreads()

   

for(n=2; n<1024; n=n*2){

    if(threadIdx.x<(1024/n)){

    if(n==2){

    

    if(shared[threadIdx.x*2]>shared[threadIdx.x*2+1]){

    printf(" n=%i tr=%i shared[%i]=%i|shared[%i]=%i  \n",n, threadIdx.x, threadIdx.x*2,shared[threadIdx.x*2],threadIdx.x*2+1,shared[threadIdx.x*2+1]);

    swap(shared[threadIdx.x+n/2-1],shared[threadIdx.x+n-1]);

                                                     }

            }

    else{

    if(shared[threadIdx.x+n/2-1]>shared[threadIdx.x+n-1]){

    printf(" n=%i tr=%i shared[%i]=%i|shared[%i]=%i  \n",n, threadIdx.x, threadIdx.x+n/2-1,shared[threadIdx.x+n/2-1],threadIdx.x+n-1,shared[threadIdx.x+n-1]);

    swap(shared[threadIdx.x+n/2-1],shared[threadIdx.x+n-1]);

    }

        }}

                   

    __syncthreads();

start in a mode of emulation

and have:

indentation is a bit confusing, but as far as I can see you want to put it in the for loop, since you are manipulating shared memory

I need to find the maximal element from a array
I try to make as in “bitonic merge”

2 4 5 8 1 2 6 5 4 8 2 2 3
–4—8—2—6—8—2–
------8--------6--------8
----------------8---------

its simple, but not understand yet…

per one block 512 threads = 1024 numbers…
and then 1024 * some blocks,
after doing same operation, like with threads…

If you want to find the maximum, you just need to adjust the reduction sample.
Change +'s into fmaxf()'s and you’re done.

I and have made like you say :)