I am running a sequence of convolutions to simulate wave propagation in a medium. Running the profiler, I find cputime is more than twice gputime for one convolution (201us vs. 408us).
Assuming the difference is due to overhead, is there a device-callable FFT in CUFFT I could use to create a single kernel? Or is there a way to use the kernels inside CUFFT to reduce the cputime?
I run several thousand convolutions at a time and over 1.5 million during the entire run of my algorithm, so the performance boost would be significant. I’d hate to write my own 2D FFT from scratch.
If you are operating on device pointer, you should not see all this overhead.
This is a code that does 2D FFT, something on the data, 2D iFFT.
rhs_complex is a device pointer.
/* Compute FFT : from physical to Fourier space */
cufftExecute(plan, rhs_complex_d,rhs_complex_d, CUFFT_FORWARD);
/* Solve the Poisson equation in Fourier space */
solve_poisson<<<dimGrid,dimBlock>>>(rhs_complex_d,kx_d,ky_d,N);
/* Compute FFT : from Fourier to physical space */
cufftExecute(plan, rhs_complex_d,rhs_complex_d, CUFFT_INVERSE);
/* Copy the solution back to a real array and apply scaling */
complex2real_scaled<<<dimGrid,dimBlock>>>(rhs_complex_d,rhs_real_d,N,scale);
This is the output from the profiler:
As you can see, the cputime and gputime are very similar.
I am expecting that for a convolution, you should see something similar
As you can see there is anywhere from 15-17us overhead for each kernel call (same as yours), but it’s roughly 33%-66% of the total time. Ugh.
The matricies d_field, d_new_field, and d_prop are all 256x64. Do you think that size just isn’t big enough to see good performance relative to kernel call overhead?
I have multiple independent convolutions to compute; too bad CUFFT doesn’t have batching for 2D transforms. Is that coming in a future release? I suppose I could use a combination of batched 1D FFTs and tranpose to get the same effect.
This approach is working great so far. Using a “naive” transpose implementation I can get 3-4x the performance for several hundred 256x64 FFTs. I expect even better performance as I scale up and once I finally get the transpose from the SDK samples working for multiple matrices. Thanks.