Why multi-GPU CUFFT uses the default cudaDeviceSynchronize()

Hi, is there any function call to use certain forms of stream synchronize rather than automatic device synchronize in multi-GPU cuFFT. I want to overlap cuFFT computation with asynchronous H2D/D2H memory copies. Thanks.