I’ve just upgraded my machine to CUDA 2.0 (from the beta), and I’m wondering if a bunch a calls have quietly been made asynchronous? For example, I have a routine re-arrange some data, call CUFFT, and then arrange the data back. All of a sudden, my timers (using CUT) are claiming that the FFT is taking less than half a millisecond, when it used to take 60ms or so. Curiously though, the final arrangement is now taking an extra 60 ms. The code is as follows
CUT_SAFE_CALL( cutStartTimer( t_csk1 ) );
AggregateKernel<<<inGridDim,blockDim>>>( uSize, vSize,
-uSize/2, -vSize/2,
d_uvCurr,
(cuComplex*)ImagerWorkspace );
cudaThreadSynchronize();
CUT_SAFE_CALL( cutStopTimer( t_csk1 ) );
CUT_CHECK_ERROR("CSK1 execution failed\n");
CUT_SAFE_CALL( cutStartTimer( t_fft ) );
cuErr = cufftExecC2R( plan,
(cufftComplex*)ImagerWorkspace,
(float *)ImagerWorkspace );
CUT_SAFE_CALL( cutStopTimer( t_fft ) );
if( cuErr != CUFFT_SUCCESS ) {
printf( "FFT Failed on polarisation %d\n", i );
return( -1 );
}
CUT_SAFE_CALL( cutStartTimer( t_csk2 ) );
CircularShiftKernel<true,false,true>
<<<outGridDim,blockDim>>>( outw, outh, 2*inh,
-outw/2, -1-outh/2,
normalisation, (float *)ImagerWorkspace,
&d_imageData[i*uSize*vSize] );
cudaThreadSynchronize();
CUT_SAFE_CALL( cutStopTimer( t_csk2 ) );
CUT_CHECK_ERROR("CSK2 execution failed\n");
The time recorded by t_fft is amost zero, while the time reported by t_csk2 has increased by the former value of t_fft.
In another routine, I do a bunch of cudaMemcpys (not Async), and the time for those has suddenly dramatically reduced, but the balance of time appears to show up later in the routine.