Asynchronocity in CUDA 2.0

I’ve just upgraded my machine to CUDA 2.0 (from the beta), and I’m wondering if a bunch a calls have quietly been made asynchronous? For example, I have a routine re-arrange some data, call CUFFT, and then arrange the data back. All of a sudden, my timers (using CUT) are claiming that the FFT is taking less than half a millisecond, when it used to take 60ms or so. Curiously though, the final arrangement is now taking an extra 60 ms. The code is as follows

CUT_SAFE_CALL( cutStartTimer( t_csk1 ) );

AggregateKernel<<<inGridDim,blockDim>>>( uSize, vSize,

          -uSize/2, -vSize/2,


          (cuComplex*)ImagerWorkspace );


CUT_SAFE_CALL( cutStopTimer( t_csk1 ) );

CUT_CHECK_ERROR("CSK1 execution failed\n");

CUT_SAFE_CALL( cutStartTimer( t_fft ) );

cuErr = cufftExecC2R( plan,


     (float *)ImagerWorkspace );

CUT_SAFE_CALL( cutStopTimer( t_fft ) );


if( cuErr != CUFFT_SUCCESS ) {

      printf( "FFT Failed on polarisation %d\n", i );

      return( -1 );


CUT_SAFE_CALL( cutStartTimer( t_csk2 ) );


      <<<outGridDim,blockDim>>>( outw, outh, 2*inh,

     -outw/2, -1-outh/2,

     normalisation, (float *)ImagerWorkspace,

     &d_imageData[i*uSize*vSize] );


CUT_SAFE_CALL( cutStopTimer( t_csk2 ) );

CUT_CHECK_ERROR("CSK2 execution failed\n");

The time recorded by t_fft is amost zero, while the time reported by t_csk2 has increased by the former value of t_fft.

In another routine, I do a bunch of cudaMemcpys (not Async), and the time for those has suddenly dramatically reduced, but the balance of time appears to show up later in the routine.

Not quietly ; it’s documented in the programming guide, subsection All kernel launch are asynchronous.

You may have noticed a difference in going to CUDA 2.0 as the queue depth for asynchronous launches has gone up from ~16 to ~150.

You gave me an idea… adding in a cudaThreadSynchronize() after the cufftExecC2R call suddenly made the reported times sensible again - it must be that cufftExec simply dispatches a lot of kernels, but doesn’t wait for any of them to complete. Assuming that it launches more than 16 (and I know from cudaprof that it launches a lot), the behaviour I observed would be consistent with the increase. Before, cufftExec had to wait for some of its kernels to complete before more could be launched (or rather, queued). Now, it can queue them all and return quickly. This does mean that I was previously victimising the wrong routine for being slow… :whistling: