Hello,

First off, let me apologize for re-posting this question from another part of the forum (Accelerated Library). I am not entirely sure if this falls under CUDA Programming and Performance or the Accelerated Library, so if I have placed this question in the wrong section please forgive me.

I would like to build a Discrete Cosine Transform (type 2) using cuFFT on Nvidia Telsa V100 DEVICE such that complex-to-complex forward FFT operation is employed. I am fairly new to cuFFT.

I am very early in the design phase of my DCST type 2 function, but briefly the algorithm follows (please be advised it is really ROUGH at this stage and assumes all arrays are of cuComplex type):

- Mirror input array [length: N] to a temp array [length: 2N]
- call forward FFT on temp array
- perform post processing on temp array, storing array into a resulting output array [length: N]

…

for(int i = 0; i < N; ++i){

float val = PI * i /(2*N);

out[i].x = cos(val)*temp[i].x; // real part

out[i].y = sin(val)*temp[i].y; // imaginary part

}

…

All the above steps would be defined as kernels with DEVICE memory allocated prior to calling and used throughout, such that only pointers to DEVICE allocated arrays are passed to each kernel. However, not sure if this is an optimal path.

Any general ideas/hints would be greatly appreciated.

Thank you.