cufftPlanMany parameter setting

Hi All, I am new to this library (and CUDA). I am trying to use the cufftPlanMany() to perform the following computation and do not know how to set the parameters of cufftPalnMany() correctly. Reading the library manual did not really help; I think Nvidia should have included some diagrams to illustrate what these parameters mean. Here’s what I’m trying to do:

I have a vector of sample values (Real), say of length N, where N is a power of 2. I want to divide this vector into segments of length W, also a power of two. So we can say that N = M*W, where M is the number of segments. Now I want to use cufftPlanMany() to compute the 1D FFT of each segment, so there will be M W-Point 1D FFTs. Then I want to average those M FFTs to produce the desired result. How do I set the parameters to do this? I think the averaging may have to be done subsequently after I get the FFTs.

I don’t really know HOW cufftPlanMany() does what it does, which would have helped me understand the meaning of these parameters.

thanks a million.

The API is documented, and there are 3 code examples in the cufft documentation that indicate how to use cufftPlanMany() in 3 different scenarios.

Perhaps you are getting tripped up on the advanced data layout parameters. These can be essentially disregarded if you have a relatively simple scenario where the data for each signal is in a single group, and the groups are adjacent in memory. We can also considerably simplify the situation for demonstration purposes by performing a C2C transform instead of an R2C transform. And as you point out, the averaging would not be done by CUFFT but by your own supplied kernel. For that you might want to take a look at the new cufft callback feature, but I would start just by getting something basic working.

To that end, here is a sample code that I put together that should hopefully help with understanding cufftPlanMany.

#include <cufft.h>
#include <cuComplex.h>
#include <stdio.h>
#define N_SIGS 32
#define SIG_LEN 1024

int main(){

  cuFloatComplex *h_signal, *d_signal, *h_result, *d_result;

  h_signal = (cuFloatComplex *)malloc(N_SIGS*SIG_LEN*sizeof(cuFloatComplex));
  h_result = (cuFloatComplex *)malloc(N_SIGS*SIG_LEN*sizeof(cuFloatComplex));
  for (int i = 0; i < N_SIGS; i ++)
    for (int j = 0; j < SIG_LEN; j++)
      h_signal[(i*SIG_LEN) + j] = make_cuFloatComplex(sin((i+1)*6.283*j/SIG_LEN), 0);
  cudaMalloc(&d_signal, N_SIGS*SIG_LEN*sizeof(cuFloatComplex));
  cudaMalloc(&d_result, N_SIGS*SIG_LEN*sizeof(cuFloatComplex));

  cudaMemcpy(d_signal, h_signal, N_SIGS*SIG_LEN*sizeof(cuFloatComplex), cudaMemcpyHostToDevice);
  cufftHandle plan;
  int n[1] = {SIG_LEN};

  cufftResult res = cufftPlanMany(&plan, 1, n,
     NULL, 1, SIG_LEN,  //advanced data layout, NULL shuts it off
     NULL, 1, SIG_LEN,  //advanced data layout, NULL shuts it off
     CUFFT_C2C, N_SIGS);
  if (res != CUFFT_SUCCESS) {printf("plan create fail\n"); return 1;}

  res = cufftExecC2C(plan, d_signal, d_result, CUFFT_FORWARD);
  if (res != CUFFT_SUCCESS) {printf("forward transform fail\n"); return 1;}
  cudaMemcpy(h_result, d_result, N_SIGS*SIG_LEN*sizeof(cuFloatComplex), cudaMemcpyDeviceToHost);

  for (int i = 0; i < N_SIGS; i++){
    for (int j = 0; j < 10; j++)
      printf("%.3f ", cuCrealf(h_result[(i*SIG_LEN)+j]));
    printf("\n"); }

  return 0;
}

well, thank you for reading and replying and for the code. I was actually trying to understand the advanced layout parameters (even if I don’t need them for the case I outlined in my initial post). As I pointed out, the documentation did not really help me with that. It should be really simple to explain what the parameters mean… that’s not what I found in the cuFFT manual. A google search on the names of these parameters (nembed in particular) will return to you multiple hits of people asking the same question (and, painfully, never getting a full satisfying answer).

inembed and onembed are actually lifted from FFTW behavior. Except for some nuances around API behavior when the first parameter is NULL (which shuts of ADL in CUDA, but may have slightly different behavior in FFTW depending on specifics of transform), the behavior and definitions of inembed and onembed should be the same between FFTW and cufft.

[url]c++ - FFTW advanced layout -- inembed=n and inembed=NULL give different results? - Stack Overflow

Have you read the ADL section:

[url]cuFFT :: CUDA Toolkit Documentation

It may be somewhat dense, but all the arithmetic seems to be there to define data layout mapping. The basic definitions are:

"The idist and odist parameters indicate the distance between the first element of two consecutive batches in the input and output data. "

"The inembed and onembed parameters define the number of elements in each dimension in the input array and the output array respectively. " Since the transforms may be multi-dimensional, the inembed and onembed parameters may be multi-dimensional

“The istride and ostride parameters denote the distance between two successive input and output elements in the least significant (that is, the innermost) dimension respectively.”

ADL would be used in specific scenarios where the data layout is not simply an arrangement of contiguous FFTs. For example, if my data sets were interleaved, then ADL would be useful.

Here’s a worked example of cufftPlanMany with advanced data layout with interleaved data sets:

[url]cuda - the results of fftw and cufft are different - Stack Overflow

Note that in the example you provided, ADL should not be necessary, as I have indicated.