synchronization speed improvement why wyncthreads improves performance?


I have been playing a bit with __syncthreads and don’t really understand why using it in the following code improves the performance of the algorithm.

// part of the __device__ function which computes a range of AES expanded keys.

      unsigned int s_expanded[(176/4)];






       for ( i = 4; i < 176/4; i++ )


                temp = s_expanded[i-1];

                if ( ( i % 4 ) == 0 )


                        temp =KeyScheduleCore ( temp, rconIter );

                        rconIter ++;




                temp = temp ^ s_expanded[i-4 ];


                s_expanded[i] = temp;





As it can be seen there is not even need for syncthreads as there is no thread communications.

The speed improvements after putting syncthreads are quite big, more than 20%.


Could it be because this way I am forcing that each multiprocessor executes the same instruction ?

And I forgot:

reg usage is 5.
Threads is 256.
Blocks is 512.

sorry for the auto-replay.