The simplest real-part Fourier transform

I have a problem for I am not able to rewrite this c function in CUDA C.

void MYFT(float * inreal, float * outreal, const int n) 
{
    for (int k = 0; k < n; k++) { 
        float sumreal = 0.0f;
        for (int t = 0; t < n; t++) { 
            float angle = 2 * 3.14159265359 * t * k / n;
            sumreal += inreal[t] * cos(angle);
        }
        outreal[k] = sumreal;
    }
}

I have started like that but have some problems since I don’t know how to finish that in order to get the best performance. I would really appreciate any help. Thanks a lot.

#define N (1 << 20)
#define BLOCK 256

MYFT<<<N/BLOCK, BLOCK>>>(d_Fx_r, d_FX_r , N);

__global__
void MYFT(const float * inreal, float * outreal, const int n)
{
    int _x   = threadIdx.x + blockIdx.x * blockDim.x;
    int _TId = threadIdx.x;

    __shared__ float s_real[BLOCK];

    s_real[_TId] = inreal[_x];

    __syncthreads();

    for(int k = _x; k < n; k +=  blockDim.x * gridDim.x){
        float sumreal = 0.0f;

cross posted here:

http://stackoverflow.com/questions/28799428/cuda-using-shared-memory-for-the-simplest-real-part-fourier-transform-implementa

There were no answers. Well, probably it’s better to do it yourself than than ask someone for.

If I understand correctly, you are attempting to implement a simple DFT with only the real component, of which the Big-O complexity is N^2.

With your example N of (1<<20) with N^2 of 1.0995116e+12.

Obviously using the cuFFT implementation which is N*log(N) will be much faster, but it seems your intent is to learn how to implement a simple DFT (real component only) in CUDA.

In that case, for learning purposes, it would be a better idea to do experiment with small DFTs, where N is some small easy number to work with, like the number of threads in a block(for example 256).

The simple general learning approach would then to have each thread be responsible for a single input and output value. Since you need to go through all the n values per output value, each thread would have to iterate and accumulate its real sum ( your loop ‘t’) then write out to global memory the answer.

Because you would not want all threads in a block to be repetitively reading the same values from global memory, each thread should first load into shared memory an adjacent input value from global memory(or values if the number of threads in a block is less than ‘n’), __syncthreads(), then do the inner ‘t’ loop and write the sum answer out to global memory in hopefully a coalesced pattern. You may still get bank conflicts, but would be preferable to using global memory.

I am just guessing at what you are attempting to do, but the approach I mentioned(assuming correctly implemented) would only be advisable in situations where you have a large number of small DFTs which could be covered by different thread blocks without sharing global memory values between those thread blocks.

For non-learning purposes I would recommend cuFFT, which will be faster in most cases.

If I understand correctly, you are attempting to implement a simple DFT with only the real component, of which the Big-O complexity is N^2.

With your example N of (1<<20) with N^2 of 1.0995116e+12.

Obviously using the cuFFT implementation which is N*log(N) will be much faster, but it seems your intent is to learn how to implement a simple DFT (real component only) in CUDA.

In that case, for learning purposes, it would be a better idea to do experiment with small DFTs, where N is some small easy number to work with, like the number of threads in a block(for example 256).

The simple general learning approach would then to have each thread be responsible for a single input and output value. Since you need to go through all the n values per output value, each thread would have to iterate and accumulate its real sum ( your loop ‘t’) then write out to global memory the answer.

Because you would not want all threads in a block to be repetitively reading the same values from global memory, each thread should first load into shared memory an adjacent input value from global memory(or values if the number of threads in a block is less than ‘n’), __syncthreads(), then do the inner ‘t’ loop and write the sum answer out to global memory in hopefully a coalesced pattern. You may still get bank conflicts, but would be preferable to using global memory.

I am just guessing at what you are attempting to do, but the approach I mentioned(assuming correctly implemented) would only be advisable in situations where you have a large number of small DFTs which could be covered by different thread blocks without sharing global memory values between those thread blocks.

For non-learning purposes I would recommend cuFFT, which will be faster in most cases.

If I understand correctly, you are attempting to implement a simple DFT with only the real component, of which the Big-O complexity is N^2.

With your example N of (1<<20) with N^2 of 1.0995116e+12.

Obviously using the cuFFT implementation which is N*log(N) will be much faster, but it seems your intent is to learn how to implement a simple DFT (real component only) in CUDA.

In that case, for learning purposes, it would be a better idea to do experiment with small DFTs, where N is some small easy number to work with, like the number of threads in a block(for example 256).

The simple general learning approach would then to have each thread be responsible for a single input and output value. Since you need to go through all the n values per output value, each thread would have to iterate and accumulate its real sum ( your loop ‘t’) then write out to global memory the answer.

Because you would not want all threads in a block to be repetitively reading the same values from global memory, each thread should first load into shared memory an adjacent input value from global memory(or values if the number of threads in a block is less than ‘n’), __syncthreads(), then do the inner ‘t’ loop and write the sum answer out to global memory in hopefully a coalesced pattern. You may still get bank conflicts, but would be preferable to using global memory.

I am just guessing at what you are attempting to do, but the approach I mentioned(assuming correctly implemented) would only be advisable in situations where you have a large number of small DFTs which could be covered by different thread blocks without sharing global memory values between those thread blocks.

For non-learning purposes I would recommend cuFFT, which will be faster in most cases.

have not seen your answer. thanks, yeah, it’s for studing purposes and I just want to implement this. without using CUFFT, which is actually serves as an example to compare my results while implementing.

That’s a good idea and kind of a thing what I am trying to do.