cuFFT's stream support

I was wondering if anyone could shed a little more light on the “undocumented and unsupported” cufftSetStream(cufftHandle, cudaStream_t) function.

Firstly, I assume it only needs to be called once per plan, straight after cufftPlan*( ).
Secondly, if a cufft plan has had cufftSetStream called for it, will the call to cufftExec*( ) be asynchronous, i.e., return control to the host immediately?

I’m trying to implement a pipeline where data is copied to the GPU, a bunch of R2C FFTs are performed on it, a couple more operations and some inverse C2C FFTs, before copying it back to the host and getting the next lot of data. Performance so far is pretty good, but I’m seeing strange things happen when I try to overlap the copies and the computation using streams. It almost appears as if the C2C transforms are overlapping properly but the R2C’s are not. But I may be doing/thinking something wrong.

Has anyone else used this new functionality, or can anyone explain a bit more about it?