is the cudamemsync blocking? if so for better performance using async mem copy with stream. when do I need to sync the result? (when everything in GPU is done and handover result back to CPU)? for example if I use async copy->fft->mult->ifft->syncstream->copy result back to host?