cuda sync and async memcpy

is the cudamemsync blocking? if so for better performance using async mem copy with stream. when do I need to sync the result? (when everything in GPU is done and handover result back to CPU)? for example if I use async copy->fft->mult->ifft->syncstream->copy result back to host?

async copy->fft->mult->ifft->async copy result back->cudaStreamSynchronize(or cudaStreamQuery)

cudaStreamQuery does not block(if it returns cudaSuccess, the stream finished)