synchronization speed improvement why wyncthreads improves performance?

Hello,

I have been playing a bit with __syncthreads and don’t really understand why using it in the following code improves the performance of the algorithm.

// part of the __device__ function which computes a range of AES expanded keys.

      unsigned int s_expanded[(176/4)];

       s_expanded[0]=g_key[0+(where*4)];

        s_expanded[1]=g_key[1+(where*4)];

        s_expanded[2]=g_key[2+(where*4)];

        s_expanded[3]=g_key[3+(where*4)];

       __syncthreads();

       for ( i = 4; i < 176/4; i++ )

        {

                temp = s_expanded[i-1];

                if ( ( i % 4 ) == 0 )

                {

                        temp =KeyScheduleCore ( temp, rconIter );

                        rconIter ++;

                        

                        __syncthreads();

                }

                temp = temp ^ s_expanded[i-4 ];

                __syncthreads();

                s_expanded[i] = temp;

                

                __syncthreads();

        }

       __syncthreads();

As it can be seen there is not even need for syncthreads as there is no thread communications.

The speed improvements after putting syncthreads are quite big, more than 20%.

Thanks!

Could it be because this way I am forcing that each multiprocessor executes the same instruction ?

And I forgot:

reg usage is 5.
Threads is 256.
Blocks is 512.

sorry for the auto-replay.