Device-callable FFT?

I am running a sequence of convolutions to simulate wave propagation in a medium. Running the profiler, I find cputime is more than twice gputime for one convolution (201us vs. 408us).

Assuming the difference is due to overhead, is there a device-callable FFT in CUFFT I could use to create a single kernel? Or is there a way to use the kernels inside CUFFT to reduce the cputime?

I run several thousand convolutions at a time and over 1.5 million during the entire run of my algorithm, so the performance boost would be significant. I’d hate to write my own 2D FFT from scratch.

Thanks in advance for any ideas.

If you are operating on device pointer, you should not see all this overhead.

This is a code that does 2D FFT, something on the data, 2D iFFT.

rhs_complex is a device pointer.

/* Compute FFT : from physical to Fourier space */

   cufftExecute(plan, rhs_complex_d,rhs_complex_d, CUFFT_FORWARD);

/* Solve the Poisson equation in Fourier space */

   solve_poisson<<<dimGrid,dimBlock>>>(rhs_complex_d,kx_d,ky_d,N);

/* Compute FFT : from Fourier to physical space */

   cufftExecute(plan, rhs_complex_d,rhs_complex_d, CUFFT_INVERSE);

/* Copy the solution back to a real array and apply scaling */

   complex2real_scaled<<<dimGrid,dimBlock>>>(rhs_complex_d,rhs_real_d,N,scale);

This is the output from the profiler:

As you can see, the cputime and gputime are very similar.

I am expecting that for a convolution, you should see something similar

Massimiliano

Thanks for the reply, mfatica. Here is my code:

// Transform field

	cufftExecute(plan, d_field, d_new_field, CUFFT_FORWARD);

	// Multiply

	ComplexPointwiseMul<<<num_blocks, THREADS_PER_BLOCK>>>(d_new_field, d_prop);

	// Transform field back

	cufftExecute(plan, d_new_field, d_new_field, CUFFT_INVERSE);

The pointers d_field, d_new_field, and d_prop are all on the device. I’ve pre-computed the FFT of d_prop to save some time.

And the profiler output:

method=[ c2c_radix4_sp ] gputime=[ 29.856 ] cputime=[ 46.000 ] occupancy=[ 0.208 ]

method=[ c2c_transpose ] gputime=[ 8.288 ] cputime=[ 24.000 ] occupancy=[ 0.333 ]

method=[ c2c_radix4_sp ] gputime=[ 17.376 ] cputime=[ 33.000 ] occupancy=[ 0.417 ]

method=[ c2c_transpose ] gputime=[ 7.776 ] cputime=[ 24.000 ] occupancy=[ 0.333 ]

method=[ ComplexPointwiseMul ] gputime=[ 7.520 ] cputime=[ 24.000 ] occupancy=[ 1.000 ]

method=[ c2c_radix4_sp ] gputime=[ 33.824 ] cputime=[ 49.000 ] occupancy=[ 0.208 ]

method=[ c2c_transpose ] gputime=[ 7.776 ] cputime=[ 24.000 ] occupancy=[ 0.333 ]

method=[ c2c_radix4_sp ] gputime=[ 17.280 ] cputime=[ 33.000 ] occupancy=[ 0.417 ]

method=[ c2c_transpose ] gputime=[ 7.744 ] cputime=[ 24.000 ] occupancy=[ 0.333 ]

As you can see there is anywhere from 15-17us overhead for each kernel call (same as yours), but it’s roughly 33%-66% of the total time. Ugh.

The matricies d_field, d_new_field, and d_prop are all 256x64. Do you think that size just isn’t big enough to see good performance relative to kernel call overhead?

I have multiple independent convolutions to compute; too bad CUFFT doesn’t have batching for 2D transforms. Is that coming in a future release? I suppose I could use a combination of batched 1D FFTs and tranpose to get the same effect.

256x64 is kind of small.

Try the 1D batch mode together with the transpose code that is in the new SDK examples.

For the 2D batch mode, we will take the request under consideration.

Massimiliano

This approach is working great so far. Using a “naive” transpose implementation I can get 3-4x the performance for several hundred 256x64 FFTs. I expect even better performance as I scale up and once I finally get the transpose from the SDK samples working for multiple matrices. Thanks.

Jim