cufftExecute() overhead.

Edgardz · June 4, 2007, 8:13pm

Hi,

Following the advice of mfatica here http://forums.nvidia.com/index.php?showtopic=37035, I write a simple code to compare the cufft batch vs a more explicit loop. My goal here is to make some processing between each of the FFT:

fft → some simple processing → fft → some simple processing … so on …

One way to do this is to keep the fft output in the device memory, apply a kernel on this result, make an other FFT and so on. But I worried about the overhead associated with each call of the cufft vs a single batch of cufft. Unfortunately, my worry was right, the following code:

...

for(pass=0; pass < nb_pass; pass++)

{

// Transform signal

CUFFT_SAFE_CALL(cufftExecute(plan, d_signal, d_signal, CUFFT_FORWARD));

//// Simple proc. here ////

// Transform signal back

CUFFT_SAFE_CALL(cufftExecute(plan, d_signal, d_signal, CUFFT_INVERSE));

}

...

take around 2980 msec to execute (4096 x 16384 FFT). With the batch mode the exacution is around 650 ms (4.5X more fast). This overhead make the performance on par with fftw on my CPU (Athlon64 2.6Ghz). One way to avoid this would be to include all the processing in a single kernel but the to do this I must rewrite the FFT code. I wonder if there is some way to call cufft code a lower level (in an other kernel) ?

Thanks !

Edgardz

mfatica · June 4, 2007, 8:20pm

Have you looked at the profiler output?
What are you doing in the simple computation between cufftExecute?

Massimiliano

Edgardz · June 4, 2007, 8:34pm

Hi Massimiliano,

In fact, my goal is to put some processing in the frequency domain, but here I just want to check the overhead associate with a loop of cufftExecute() vs the batch mode so there is no processing between the fft and ifft.

I have the profile of each case. In the “explicit loop” version I’ve got essentially :

method=[ memcopy ] gputime=[ 97.504 ]

method=[ c2c_radix2_mpsm ] gputime=[ 136.032 ] cputime=[ 174.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 20.864 ] cputime=[ 57.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 18.528 ] cputime=[ 54.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 18.176 ] cputime=[ 54.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 21.280 ] cputime=[ 57.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpsm ] gputime=[ 89.920 ] cputime=[ 126.000 ] occupancy=[ 0.667 ]

... so on (40960 line !!!)

method=[ memcopy ] gputime=[ 97.088 ]

and the batch mode is:

method=[ memcopy ] gputime=[ 97.408 ]

method=[ c2c_radix2_mpsm ] gputime=[ 210203.875 ] cputime=[ 210229.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 45560.895 ] cputime=[ 45594.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 43704.832 ] cputime=[ 43738.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 43789.793 ] cputime=[ 43823.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 43870.977 ] cputime=[ 43904.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpsm ] gputime=[ 210092.031 ] cputime=[ 211093.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 45500.703 ] cputime=[ 45947.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 43714.207 ] cputime=[ 43747.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 43815.711 ] cputime=[ 43849.000 ] occupancy=[ 0.667 ]

method=[ c2c_radix2_mpgm ] gputime=[ 43875.137 ] cputime=[ 43909.000 ] occupancy=[ 0.667 ]

method=[ memcopy ] gputime=[ 96.320 ]

Edgardz

mfatica · June 4, 2007, 8:50pm

Are you comparing a 2D transform time with a batch of 1D?

The launch overhead has been reduced in the new release.

I was wondering about your simple computation.
BTW, if you transform back and forth, the result is not scaled back. You can lump the scaling in your simple computation.

Edgardz · June 4, 2007, 9:15pm

The plan of the “explicit loop version” is :

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, SIGNAL_SIZE, CUFFT_DATA_C2C, 1));

and the batch version is

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, SIGNAL_SIZE, CUFFT_DATA_C2C, nb_pass));

where nb_pass = 4096. There are both 1D complex transform.I will try the same code as soon as the release is out. It is coming soon ?

Edgardz

Edgardz · June 12, 2007, 6:40pm

Wow ! I have just run the same benchmark on the CUDA 0.9 and the cufftExecute() overhead is much more lower. The “explicit loop version” now run in approx. 880 ms. Great jobs folks !

MattWarmuth · March 6, 2008, 9:38pm

I have a similar desire. For example, if I want to do a complex multiply in the frequency domain, or simply scalar multiply the FFT by the number of pixels (for forward/backward pairs), I have to read the first FFT’s output from global memory (along with the frequency modifier), do the multiplication, then write it back into global memory before calling the second FFT.

It’s quite a waste of global memory fetches to do a couple multiplies and adds in between. It would be nice if the FFT calls included an option (kind of like the CUBLAS functions) to multiply the outputs (or inputs) by an array or a constant value to include those math functions in the inherent FFT calls WHILE they’re in local memory.

A good example of this is in optical processing. To generate an optical transfer function (OTF) from a complex pupil function (CPF), an FFT is taken of the (CPF), then it’s magnitude squared is calculated to get the incoherent point spread function (PSF), and an FFT of the result is taken to finally yield the OTF.

I suppose one could catagorize this as a library function addition/modification request. Is that possible?