cuFFT 1D row by row using cufftPlanMany

Hi, I’m experimenting with implementing some basic DSP filtering with CUDA. I finished my 1D direct FFT filter and am now trying to filter a 2D matrix row by row but faster then just doing them sequentially in 1D arrays row by row.

I mostly read to do this with cufftPlanMany instead of cufftPlan1D with batches but am struggling to figure out how I can properly set the length of my FFT.

As I’m doing DSP filtering I want to do an FFT of my impulse response (filter) and my signal. With the length of the FFT being chosen by finding the next greater power of 2 of (signalLength+irLength-1). Doing this in 1D with cufftPlan1D allowed me to set the size of the FFT with the ‘nx’ argument.

cufftPlan1d(&plan, fftLength, CUFFT_R2C, 1));

But given that now in 2D my signal matrix is a of size signalLength*rows, where can I tell cuFFT that it needs to pad each row that it uses as input for the FFT so that it becomes of my chosen length?

Cause if I use:

cufftPlanMany(&plan, 1, {fftLength}, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, rows)

How it will know the length of each signal(row) ? I’m guessing I would need to set ‘idist’ argument for that but I’m struggling to figure out what all the other arguments should be then. My best guesses would be:

int rank = 1;
int n[] = {fftLength};
int inembed[] = {0};
int istride = 1;
int idist = inputLength;
int onembed[] = {0};
int ostride = 1;
int odist = (fftLength/2) + 1;
int batch = rows;
cufftPlanMany(&forwardPlanInput, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, batch);

But I very much doubt it is correct as the results are only correct for the first row.

I made some progress. Seems cufftPlanMany won’t be capable to do the padding so doing that in a seperate step using cudaMemset2D. However now I’m still facing the issue of doing row by row 1D FFTs of input. I saw some examples that also worked with pitched input but those all performed 2D FFTs not 1D. If I actually do perform a 2D FFT it works fine. It’s just the 1D that isn’t working.

This is my current code:

float *input_d, *input_zp_d;
cufftComplex *input_c_d;

// Input for testing
int rows = 3;
int inputLength = 10;
fftLength = 16;
float* input = new float[inputLength*rows];
for (int i = 0; i < rows; i ++){
	for (int j = 0; j < inputLength; j ++){
		input[i*inputLength + j] = i*inputLength + j; 
	}
}

// allocate memory
size_t pitch_input;
size_t pitch_input_zp;
size_t pitch_input_c;
cudaErrchk(cudaMallocPitch(&input_d,&pitch_input, inputLength*sizeof(float),rows));
cudaErrchk(cudaMallocPitch(&input_zp_d,&pitch_input_zp, fftLength*sizeof(float),rows));
cudaErrchk(cudaMallocPitch(&input_c_d,&pitch_input_c, ((fftLength/2)+1)*sizeof(cufftComplex),rows));

// Prepare zero padded memory for input matrix
cudaErrchk(cudaMemset2D(input_zp_d, pitch_input_zp, 0, fftLength*sizeof(float), rows));

// Copy input matrix to zero padded memory
cudaErrchk(cudaMemcpy2D(input_zp_d, pitch_input_zp, input, inputLength*sizeof(float), inputLength*sizeof(float), rows, cudaMemcpyHostToDevice));

// Create CuFFT plans
cufftHandle forwardPlanInput;
int rank = 1;
int n[] = {fftLength};
int inembed[] = {pitch_input_zp/sizeof(float)};
int istride = 1;
int idist = fftLength;
int onembed[] = {pitch_input_c/sizeof(cufftComplex)};
int ostride = 1;
int odist = ((fftLength/2)+1);
int batch = rows;
cuFFTErrchk(cufftPlanMany(&forwardPlanInput, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, batch));

// Calculate FFT of input arrays
cuFFTErrchk(cufftExecR2C(forwardPlanInput,input_zp_d, input_c_d));
cudaDeviceSynchronize();
cufftComplex* testOutput = new cufftComplex[((fftLength/2)+1)*rows];
cudaErrchk(cudaMemcpy2D(testOutput,((fftLength/2)+1)*sizeof(cufftComplex), input_c_d,pitch_input_c, ((fftLength/2)+1)*sizeof(cufftComplex),rows, cudaMemcpyDeviceToHost));
cudaDeviceSynchronize();
std::cout << "Signal FFT:"  << std::endl; 
for (int i = 0; i < rows; i ++){
	for (int j = 0; j < ((fftLength/2)+1); j ++){
			std::cout << testOutput[i*((fftLength/2)+1) + j].x << "+" << testOutput[i*((fftLength/2)+1) + j].y << "j "; 
	}
	std::cout << std::endl; 
}

Anyone have any ideas?

I have the same problem as you.
In my opinion, I think you shoulde change the following cufftPlanMany parameters as:

int inembed = {fftLength};
int onembed = {fftLength/2 + 1};
int idist = {pitch_input_zp/sizeof(float)};
int odist = {pitch_input_c/sizeof(cufftComplex)};

Other parameters remain unchanged.
I wonder if your problem has been solverd now. If so, how did you solve it?

It has been a year since I worked on this but yeah I solved but cannot explain in detail how I came to the solution as again, I have forgotten :) But a quick look at the code it seems this how I ended up doing it:

		gpuErrchk(cudaMalloc(&input_d, fftlength*sizeof(float)*rows));
		gpuErrchk(cudaMalloc(&outputComplex_d, ((fftlength/2)+1)*sizeof(cufftComplex)*rows));

		gpuErrchk(cudaMemcpy(input_d, input, fftlength*sizeof(float)*rows, cudaMemcpyHostToDevice));

		cufftHandle forwardPlan;
		int n[] = {fftlength};
		int inembed[] = {0}, onembed[] = {0};
		int idist = fftlength, odist = ((fftlength/2)+1);      
		cuFFTErrchk(cufftPlanMany(&forwardPlan, 1, n, inembed, 1, idist, onembed, 1, odist, CUFFT_R2C, rows));	

		cuFFTErrchk(cufftExecR2C(forwardPlan, input_d, outputComplex_d));

No idea if it’s a good solution or not or a particularly efficient one, but it works :)

1 Like

It looks like you solved the problem with a one-dimensional array instead of a two-dimensional array.

Glad to see that your problem has been solved. Good luck!