kernel is faster when adding a __syncthreads() call - because of synchronizing memory transfers?

As I say in the title - I’m finding a modest performance improvement for kernels when I __syncthreads() before transferring memory back to the GPU’s VRAM. Would this just be because the memory transfer is therefore simultaneous? Are there other reasons syncing threads could help a kernel’s speed?

__syncthreads() can have significant influence on caching behavior, if synchronized execution is necessary to achieve good data reuse.

You can check with the profiler how effective the cache is with and without __syncthreads().