cufft only gives non-redundant results


I’m doing a simple 2D real to complex FFT but the cufftExecR2C only gives half the results called the non-redundant complex coefficients? But I want the full results including the redundant ones. Could anyone shine me some light on this pls? Thanks for any help in advance.


I don’t why you need the other coefficients (my algorithms works well with only the non redundant coefficients)

But I implemented a kernel that makes what you want (just curiosity to compare with the matlab results), so why not share :)

w = width of the initial (real) data

h = height

src is the array returned by cufft (the width is (w/2 + 1) and the height is h )

dst is the new array you want to compute (width is w and height is h) with all the coefficients

notice that with this new array you won’t be able to call a C2R transform in cufft( cufft’s C2R transform needs the array with only the non redundant coefficients)

it’s a “bullet-proof” kernel, so you can tune the code a little more if you know the call configuration in advance, especially the if conditions that check everything

this kernel will need “at least w/2 + 1 threads along X” times “at least h threads along Y” with any block/grid configuration, with the usual recommendations

__global__ void k_makeRedundant(float2* dst, const float2* src, int w, int h) {

	volatile int gid_x = threadIdx.x + blockIdx.x * blockDim.x;

	volatile int gid_y = threadIdx.y + blockIdx.y * blockDim.y;

	volatile int nbNoRedundants = (w >> 1) + 1;

	// index for reading :

	volatile int gid = gid_x + nbNoRedundants * gid_y;

	float2 val;

	if(gid_x < nbNoRedundants && gid_y < h) {

		// write the non redundant part in the new array :

		val = src[gid];

		gid = gid_x + w * gid_y; // new index for writing

		dst[gid] = val;


	// shift'n'flip

	gid_x = w - gid_x;

	if(gid_y != 0)

		gid_y = h - gid_y;

	gid = gid_x + w * gid_y;

	// write conjugate :

	if(gid_x >= nbNoRedundants && gid_x < w && gid_y >= 0 && gid_y < h) {

		val.y = -val.y;

		dst[gid] = val; // never coalesced with compute <= 1.1 ; coalesced if >= 1.2 AND w multiple of 16 AND good call configuration



Oh, and I may be able to find on my hard drive a specific version I wrote for compute 1.1 devices. Not 100% coalesced (w/2 + 1 and w are not good at being multiples of 16 at the same time :D), but much better in most cases, concerning the final write (handle the reversing of the index along X with shared memory and perform a coalesced write)

Thank you so much Cuda Libre. I need the rest of the coefficients as I want to draw the 2D FFT power spectrum. Please correct me if I’m doing something wrong.

Hi, thank you for sharing your code. I used your method but got all zeros. I checked my original code which only returns non-redundant values worked fine. Could you help me with where I might have done wrong? Thanks a lot!

float2 *FFTR2C(float *array1D, int NX, int NY)


// Pointer for host memory

float2 *odata;

// Pointer for device memory

cufftHandle plan;

cufftReal *data1;

cufftComplex *data2, *data3;

// Calculate memory needed

unsigned int memSizeIn=sizeof(cufftReal)*NX*NY;

    unsigned int memSizeOut=sizeof(cufftComplex)*NX*NY;

// Define grid and block size

    dim3 gridSize(NX/BLOCK_X, NY/BLOCK_Y);

    dim3 blockSize(NX/2+1,NY);

// Allocate memory on the CPU side


// Allocate memory on the GPU side

cudaMalloc((void**)&data1, memSizeIn);

cudaMalloc((void**)&data2, memSizeOut);

cudaMalloc((void**)&data3, memSizeOut);


// 2-D Real to Complex Non-redundant FFT


// Copy from host to device

cudaMemcpy(data1, array1D, memSizeIn, cudaMemcpyHostToDevice);

// Create a 2D FFT plan

cufftPlan2d(&plan, NX, NY, CUFFT_R2C);

// Use the CUFFT plan to transform the signal out of place

cufftExecR2C(plan, data1, data2);


// Recover Redundant Complex Coefficients


// Launch kernel

RecoverRedundant<<<gridSize, blockSize>>>(data3, data2, NX, NY); 

// Copy from device to host

    cudaMemcpy(odata, data3, NX*NY*sizeof(float2), cudaMemcpyDeviceToHost);

// Check for any CUDA errors


// Destroy the CUFFT plan and free memory





for (int i=0; i<100; i++)


	printf("FFT complex x=%f y=%d\n", odata[i].x, odata[i].y);


return odata;



I think you’re wrong when writing :

dim3 gridSize(NX/BLOCK_X, NY/BLOCK_Y);

dim3 blockSize(NX/2+1,NY);

Here you’re trying to launch [ (NX/BLOCK_X) * (NY/BLOCK_Y) * (NX/2 + 1) * NY ] threads, don’t you think it is not a correct value ? (since there is NX * NY pixels)

Try to set appropriate sizes. Use functiosn like these :

dim3 makeGridSize(int nx, int ny, int nz, const dim3& blockDim) {

        int gx = (nx + blockDim.x - 1) / blockDim.x,

            gy = (ny + blockDim.y - 1) / blockDim.y,

            gz = (nz + blockDim.z - 1) / blockDim.z;

return dim3(gx, gy, gz);


then use :

dim3 blockSize(16, 16, 1); // quite standard, but you may want to do finer tuning depending on your algorithm and your device

dim3 gridSize = makeGridSize(NX/2 + 1, NY, 1, blockSize); // build the appropriate gridSize in function of blockSize

Don’t hesitate if it doesn’t work

Thanks a lot. I should have noticed I made a silly mistake there. The program is now working alright. May I ask if there’s any benefit to use 3-dimensional blocks? I’ve seen many people suggest only use 2-dimensional as there can be complicated problems with 3-dimensional blocks. Thank you again for your kind help.