I’m doing a simple 2D real to complex FFT but the cufftExecR2C only gives half the results called the non-redundant complex coefficients? But I want the full results including the redundant ones. Could anyone shine me some light on this pls? Thanks for any help in advance.

I don’t why you need the other coefficients (my algorithms works well with only the non redundant coefficients)

But I implemented a kernel that makes what you want (just curiosity to compare with the matlab results), so why not share :)

w = width of the initial (real) data

h = height

src is the array returned by cufft (the width is (w/2 + 1) and the height is h )

dst is the new array you want to compute (width is w and height is h) with all the coefficients

notice that with this new array you won’t be able to call a C2R transform in cufft( cufft’s C2R transform needs the array with only the non redundant coefficients)

it’s a “bullet-proof” kernel, so you can tune the code a little more if you know the call configuration in advance, especially the if conditions that check everything

this kernel will need “at least w/2 + 1 threads along X” times “at least h threads along Y” with any block/grid configuration, with the usual recommendations

__global__ void k_makeRedundant(float2* dst, const float2* src, int w, int h) {
volatile int gid_x = threadIdx.x + blockIdx.x * blockDim.x;
volatile int gid_y = threadIdx.y + blockIdx.y * blockDim.y;
volatile int nbNoRedundants = (w >> 1) + 1;
// index for reading :
volatile int gid = gid_x + nbNoRedundants * gid_y;
float2 val;
if(gid_x < nbNoRedundants && gid_y < h) {
// write the non redundant part in the new array :
val = src[gid];
gid = gid_x + w * gid_y; // new index for writing
dst[gid] = val;
}
// shift'n'flip
gid_x = w - gid_x;
if(gid_y != 0)
gid_y = h - gid_y;
gid = gid_x + w * gid_y;
// write conjugate :
if(gid_x >= nbNoRedundants && gid_x < w && gid_y >= 0 && gid_y < h) {
val.y = -val.y;
dst[gid] = val; // never coalesced with compute <= 1.1 ; coalesced if >= 1.2 AND w multiple of 16 AND good call configuration
}
}

Oh, and I may be able to find on my hard drive a specific version I wrote for compute 1.1 devices. Not 100% coalesced (w/2 + 1 and w are not good at being multiples of 16 at the same time :D), but much better in most cases, concerning the final write (handle the reversing of the index along X with shared memory and perform a coalesced write)

Thank you so much Cuda Libre. I need the rest of the coefficients as I want to draw the 2D FFT power spectrum. Please correct me if I’m doing something wrong.

Hi, thank you for sharing your code. I used your method but got all zeros. I checked my original code which only returns non-redundant values worked fine. Could you help me with where I might have done wrong? Thanks a lot!

float2 *FFTR2C(float *array1D, int NX, int NY)

{

// Pointer for host memory
float2 *odata;
// Pointer for device memory
cufftHandle plan;
cufftReal *data1;
cufftComplex *data2, *data3;
// Calculate memory needed
unsigned int memSizeIn=sizeof(cufftReal)*NX*NY;
unsigned int memSizeOut=sizeof(cufftComplex)*NX*NY;
// Define grid and block size
dim3 gridSize(NX/BLOCK_X, NY/BLOCK_Y);
dim3 blockSize(NX/2+1,NY);
// Allocate memory on the CPU side
odata=(float2*)malloc(NX*NY*sizeof(float2));
// Allocate memory on the GPU side
cudaMalloc((void**)&data1, memSizeIn);
cudaMalloc((void**)&data2, memSizeOut);
cudaMalloc((void**)&data3, memSizeOut);
////////////////////////////////////////////////////////////////////////////////
// 2-D Real to Complex Non-redundant FFT
////////////////////////////////////////////////////////////////////////////////
// Copy from host to device
cudaMemcpy(data1, array1D, memSizeIn, cudaMemcpyHostToDevice);
// Create a 2D FFT plan
cufftPlan2d(&plan, NX, NY, CUFFT_R2C);
// Use the CUFFT plan to transform the signal out of place
cufftExecR2C(plan, data1, data2);
////////////////////////////////////////////////////////////////////////////////
// Recover Redundant Complex Coefficients
////////////////////////////////////////////////////////////////////////////////
// Launch kernel
RecoverRedundant<<<gridSize, blockSize>>>(data3, data2, NX, NY);

// Copy from device to host

cudaMemcpy(odata, data3, NX*NY*sizeof(float2), cudaMemcpyDeviceToHost);
// Check for any CUDA errors
cudaThreadSynchronize();
// Destroy the CUFFT plan and free memory
cufftDestroy(plan);
cudaFree(data1);
cudaFree(data2);
cudaFree(data3);
for (int i=0; i<100; i++)
{
printf("FFT complex x=%f y=%d\n", odata[i].x, odata[i].y);
}
return odata;

Here you’re trying to launch [ (NX/BLOCK_X) * (NY/BLOCK_Y) * (NX/2 + 1) * NY ] threads, don’t you think it is not a correct value ? (since there is NX * NY pixels)

Try to set appropriate sizes. Use functiosn like these :

dim3 blockSize(16, 16, 1); // quite standard, but you may want to do finer tuning depending on your algorithm and your device
dim3 gridSize = makeGridSize(NX/2 + 1, NY, 1, blockSize); // build the appropriate gridSize in function of blockSize

Thanks a lot. I should have noticed I made a silly mistake there. The program is now working alright. May I ask if there’s any benefit to use 3-dimensional blocks? I’ve seen many people suggest only use 2-dimensional as there can be complicated problems with 3-dimensional blocks. Thank you again for your kind help.