1D FFT 3x faster than 2D FFT?

I’ve been playing around with FFTs on 512x512 complex matrices as I need to save some time and ran into a strange behavior. FFT should be separable (and that is how I believe most implementation implement 2D FFT). So running FFT first in X and than in Y should provide a 2D FFT. But when I timed both options, it seems that 2D fft is 3 times slower than 1D FFT rather than 2 times (to account for X and Y). Is this behavior expected or an I missing something?

cufftHandle plan;

CUFFTERR(cufftPlan2d(&plan, WIDTH, HEIGHT, CUFFT_C2C));

cufftHandle plany;

int size = HEIGHT;

CUFFTERR(cufftPlanMany(&plany, 1, &size, NULL, WIDTH, 1, NULL, WIDTH, 1, CUFFT_C2C, WIDTH));

cufftHandle planx;

size = WIDTH;

CUFFTERR(cufftPlanMany(&planx, 1, &size, NULL, 1, WIDTH, NULL, 1, WIDTH, CUFFT_C2C, HEIGHT));

for (int i = 0 ; i < NITER ; i++)

	CUFFTERR(cufftExecC2C(plan, (cufftComplex *)device, (cufftComplex *)device, CUFFT_FORWARD));

for (int i = 0 ; i < NITER ; i++)

	CUFFTERR(cufftExecC2C(planx, (cufftComplex *)device, (cufftComplex *)device, CUFFT_FORWARD));

for (int i = 0 ; i < NITER ; i++)

	CUFFTERR(cufftExecC2C(plany, (cufftComplex *)device, (cufftComplex *)device, CUFFT_FORWARD));

second and third options (1D in X and Y) take 10.4 ms each (for 272 iterations)

last option (2D) takes 29.2 ms

If you use NULL for inembed and onembed in your plany, the following arguments (WIDTH and 1) will be ignored.
So your code is not correct and since it is doing FFTs on contiguous data twice (not a 2D FFT), it is faster.

Since the transform is 1D, any non NULL value will work since inembed[0] is never used

From the manual:
The following equations illustrate how these parameters are used to calculate the index for each element in the input or output array:
b = 0 … batch − 1
x = 0 … n[0] − 1
y = 0 … n[1] − 1
z = 0 … n[2] − 1
􏱣 1D
output_index = b ∗ odist + x ∗ ostride
input_index = b ∗ idist + x ∗ istride

I found it out this morning. I needed to pass inembed = oembed = WIDTH*HEIGHT to get the right output and then the FFT in Y takes about 30% longer than the 2D fft. I’m trying to find a different trick now as I’m trying to perform a very sparse FFT and I’m trying to find a method to make it run faster than the dense FFT. Will be happy for any pointers in that direction …

The CUFFT documentation is very terse on the inembed and oembed values so it took some time to figure them out.