The simplest real-part Fourier transform

AlexIanevski · March 7, 2015, 8:32pm

I have a problem for I am not able to rewrite this c function in CUDA C.

void MYFT(float * inreal, float * outreal, const int n) 
{
    for (int k = 0; k < n; k++) { 
        float sumreal = 0.0f;
        for (int t = 0; t < n; t++) { 
            float angle = 2 * 3.14159265359 * t * k / n;
            sumreal += inreal[t] * cos(angle);
        }
        outreal[k] = sumreal;
    }
}

I have started like that but have some problems since I don’t know how to finish that in order to get the best performance. I would really appreciate any help. Thanks a lot.

#define N (1 << 20)
#define BLOCK 256

MYFT<<<N/BLOCK, BLOCK>>>(d_Fx_r, d_FX_r , N);

__global__
void MYFT(const float * inreal, float * outreal, const int n)
{
    int _x   = threadIdx.x + blockIdx.x * blockDim.x;
    int _TId = threadIdx.x;

    __shared__ float s_real[BLOCK];

    s_real[_TId] = inreal[_x];

    __syncthreads();

    for(int k = _x; k < n; k +=  blockDim.x * gridDim.x){
        float sumreal = 0.0f;

Robert_Crovella · March 7, 2015, 10:45pm

cross posted here:

[url]http://stackoverflow.com/questions/28799428/cuda-using-shared-memory-for-the-simplest-real-part-fourier-transform-implementa[/url]

AlexIanevski · March 7, 2015, 11:37pm

There were no answers. Well, probably it’s better to do it yourself than than ask someone for.

CudaaduC · March 8, 2015, 9:14pm

If I understand correctly, you are attempting to implement a simple DFT with only the real component, of which the Big-O complexity is N^2.

With your example N of (1<<20) with N^2 of 1.0995116e+12.

Obviously using the cuFFT implementation which is N*log(N) will be much faster, but it seems your intent is to learn how to implement a simple DFT (real component only) in CUDA.

In that case, for learning purposes, it would be a better idea to do experiment with small DFTs, where N is some small easy number to work with, like the number of threads in a block(for example 256).

The simple general learning approach would then to have each thread be responsible for a single input and output value. Since you need to go through all the n values per output value, each thread would have to iterate and accumulate its real sum ( your loop ‘t’) then write out to global memory the answer.

Because you would not want all threads in a block to be repetitively reading the same values from global memory, each thread should first load into shared memory an adjacent input value from global memory(or values if the number of threads in a block is less than ‘n’), __syncthreads(), then do the inner ‘t’ loop and write the sum answer out to global memory in hopefully a coalesced pattern. You may still get bank conflicts, but would be preferable to using global memory.

I am just guessing at what you are attempting to do, but the approach I mentioned(assuming correctly implemented) would only be advisable in situations where you have a large number of small DFTs which could be covered by different thread blocks without sharing global memory values between those thread blocks.

For non-learning purposes I would recommend cuFFT, which will be faster in most cases.

CudaaduC · March 8, 2015, 9:14pm

If I understand correctly, you are attempting to implement a simple DFT with only the real component, of which the Big-O complexity is N^2.

With your example N of (1<<20) with N^2 of 1.0995116e+12.

Obviously using the cuFFT implementation which is N*log(N) will be much faster, but it seems your intent is to learn how to implement a simple DFT (real component only) in CUDA.

In that case, for learning purposes, it would be a better idea to do experiment with small DFTs, where N is some small easy number to work with, like the number of threads in a block(for example 256).

The simple general learning approach would then to have each thread be responsible for a single input and output value. Since you need to go through all the n values per output value, each thread would have to iterate and accumulate its real sum ( your loop ‘t’) then write out to global memory the answer.

Because you would not want all threads in a block to be repetitively reading the same values from global memory, each thread should first load into shared memory an adjacent input value from global memory(or values if the number of threads in a block is less than ‘n’), __syncthreads(), then do the inner ‘t’ loop and write the sum answer out to global memory in hopefully a coalesced pattern. You may still get bank conflicts, but would be preferable to using global memory.

I am just guessing at what you are attempting to do, but the approach I mentioned(assuming correctly implemented) would only be advisable in situations where you have a large number of small DFTs which could be covered by different thread blocks without sharing global memory values between those thread blocks.

For non-learning purposes I would recommend cuFFT, which will be faster in most cases.

CudaaduC · March 8, 2015, 9:14pm

If I understand correctly, you are attempting to implement a simple DFT with only the real component, of which the Big-O complexity is N^2.

With your example N of (1<<20) with N^2 of 1.0995116e+12.

Obviously using the cuFFT implementation which is N*log(N) will be much faster, but it seems your intent is to learn how to implement a simple DFT (real component only) in CUDA.

In that case, for learning purposes, it would be a better idea to do experiment with small DFTs, where N is some small easy number to work with, like the number of threads in a block(for example 256).

The simple general learning approach would then to have each thread be responsible for a single input and output value. Since you need to go through all the n values per output value, each thread would have to iterate and accumulate its real sum ( your loop ‘t’) then write out to global memory the answer.

Because you would not want all threads in a block to be repetitively reading the same values from global memory, each thread should first load into shared memory an adjacent input value from global memory(or values if the number of threads in a block is less than ‘n’), __syncthreads(), then do the inner ‘t’ loop and write the sum answer out to global memory in hopefully a coalesced pattern. You may still get bank conflicts, but would be preferable to using global memory.

I am just guessing at what you are attempting to do, but the approach I mentioned(assuming correctly implemented) would only be advisable in situations where you have a large number of small DFTs which could be covered by different thread blocks without sharing global memory values between those thread blocks.

For non-learning purposes I would recommend cuFFT, which will be faster in most cases.

AlexIanevski · March 11, 2015, 9:39pm

have not seen your answer. thanks, yeah, it’s for studing purposes and I just want to implement this. without using CUFFT, which is actually serves as an example to compare my results while implementing.

That’s a good idea and kind of a thing what I am trying to do.

Topic		Replies	Views
CUFFT CUFFT Library CUDA Programming and Performance	2	3759	October 18, 2007
Real to complex FFTs using the cufftdx library GPU-Accelerated Libraries cufft	2	514	September 8, 2023
How to implement shared memory in kernel CUDA Programming and Performance	2	1008	May 4, 2011
FFT of two real samples using one C2C of cuFFTDx GPU-Accelerated Libraries cufft	1	601	September 8, 2023
CUDA FFT of reals in 2D CUDA Programming and Performance	5	15061	June 12, 2007
What is the real memory usage of cudaFFT CUDA Programming and Performance	5	16836	January 21, 2008
in-place FFT fails on 2D real transform CUDA Programming and Performance	0	849	August 13, 2009
2D FFT- Real to Complex and Complex to Real CUDA Programming and Performance	0	2210	February 3, 2009
Newbie to cuFFT - how to do real-to-real transforms GPU-Accelerated Libraries	5	2162	February 19, 2019
Real to real transform with CUFFT? CUDA Programming and Performance	2	5797	June 28, 2008

The simplest real-part Fourier transform

Related topics