Are the cufft libraries calls asynchronueous?


Are the cufft calls asynchroneuous? I have an iterative process which a function update

__host__ void update(cufftDoubleReal *dbbff,cufftDoubleReal *dppsi, double *ddqq,cufftDoubleReal *hbbff,cufftDoubleReal *hppsi,int llx,int lly,int totsize,int totsize_pad,int totsize_invspa,double rr,const double q0,double ddt,cufftHandle pprc,cufftHandle ppcr,dim3 ggrid, dim3 tthreads)
    nonlinterm < < < ggrid,tthreads > > > (dbbff,dppsi, totsize_pad); 
    kupdt < < < ggrid,tthreads > > > ((cufftDoubleComplex*)dppsi,(cufftDoubleComplex*)dbbff,ddqq,totsize_invspa,totsize,rr,ddt,q0);

is calles for nout times. At this point I copy data to cpu to check for convergence. I noticed that calling the function nout*10 takes the same time as (nout/10)x100. This lets me believes that the cufft calls are blocking. Is this right? Can the calls be made asynchrouneuous. for iterative processes I think there might be improvement in performance.
(I tried to look in the manual, but there there are only mentioned the streams. I have only one stream)

Here is the call sequence in the main function:

for(int sss=1;sss<=nsteps;sss++)
    for(int n=1;n < = nend;n++)
    CUDA_CHECK( cudaMemcpy(hpsi, dpsi, sizeof(double)*totsize_pad,cudaMemcpyDeviceToHost) );    
// Start of energy function    
   printf("%26.20lf %26.20lf\n",ene[sss],(r+pow(q0,4))*pm*pm/2.0+pm*pm*pm*pm/4.0); 
   fprintf(pFile,"%d %26.20lf %26.20lf\n",sss,ene[sss],(r+pow(q0,4))*pm*pm/2.0+pm*pm*pm*pm/4.0); 
// some simpel cpu stuff + saving data
   printf("%d %d\n",sss,cpart);

is caleld for nout times,