__syncthreads and __threadfence together in a loop

I have got confusion with the use of cuda functions __threadfence and __syncthreads() together .
Documentation says that threadfence guarantees that all global and shared memory accesses made by the calling thread prior to this function call are visible to all the threads in the system( I use global memory only).
And syncthreads waits until all the threads in the block have reached at a particular point .

So , my question is ,after updating a variable , if I use threadfence() , followed by syncthreads , all the threads in the same block should have the access to the correct values of variables updated by all other threads in the block . shouldn’t ? If I use __syncthreads alone what is the difference ?? or if I change the order of this functions what is the difference ?? ie;
__threadfence(); __syncthreads();
__syncthreads(); to __threadfence();

Unfortunately , in my code I tried all these cases , still all threads are not getting the updated values from all other threads in the same block . In principle I thought the first order should work fine .
I am using a loop for a single block . If I use the loop in host code it works fine . But if the loop is in the kernel , simply it produce incorrect result . I understand the problem with inter-block communication .
But the loop doesn’t communicate with other blocks .

Please help me .

I have got confusion with the use of cuda functions __threadfence and __syncthreads() together .
Documentation says that threadfence guarantees that all global and shared memory accesses made by the calling thread prior to this function call are visible to all the threads in the system( I use global memory only).
And syncthreads waits until all the threads in the block have reached at a particular point .

So , my question is ,after updating a variable , if I use threadfence() , followed by syncthreads , all the threads in the same block should have the access to the correct values of variables updated by all other threads in the block . shouldn’t ? If I use __syncthreads alone what is the difference ?? or if I change the order of this functions what is the difference ?? ie;
__threadfence(); __syncthreads();
__syncthreads(); to __threadfence();

Unfortunately , in my code I tried all these cases , still all threads are not getting the updated values from all other threads in the same block . In principle I thought the first order should work fine .
I am using a loop for a single block . If I use the loop in host code it works fine . But if the loop is in the kernel , simply it produce incorrect result . I understand the problem with inter-block communication .
But the loop doesn’t communicate with other blocks .

Please help me .

After “__syncthreads()” – you are sure that all threads completed the instructions before it… so , you know every thread executed “threadFence”.
If you swap the order and issue “syncthreads” first, it makes no sense.

Compiler also optimizes based on “__syncthreads” usage.

After “__syncthreads()” – you are sure that all threads completed the instructions before it… so , you know every thread executed “threadFence”.
If you swap the order and issue “syncthreads” first, it makes no sense.

Compiler also optimizes based on “__syncthreads” usage.

Thank you . But what does that mean by “Compiler also optimizes based on “__syncthreads” usage.” ???

Thank you . But what does that mean by “Compiler also optimizes based on “__syncthreads” usage.” ???