convolution with CUFFT2dPLAN - errors in output GPU audio reverb plugin cufft cufft2dplan


I ve been trying to write a real-time VST impulse response reverb plug in using cufft for the FFT transforms. I ve managed to make it work with a 1 dimensional plan

but it takes quite a while and I get a CPU load in the range of 30 - 80% , depending on the impulse

response(IR) array size.

I ve seen that 2dimensional plans take much less time, and I tried to implement one. The problem is that I m getting a degraded output. Sounds like if you re decreasing the bit-rate of a given wav file.

I believe it might be because of the way I normalize the convolved signal but can’t figure out whats wrong … Theres no error or anything if the buffers are not convolved but just copied and sent back to the host… Or if they are convolved using a 1d cufft plan.

I ve been over and over any available online tutorials and here is the device code and the convolution kernel i ve ended up with( Note that the same code works fine for a 1d transform but loads the CPU a lot ).




This function is being called in the plugin through a worker thread …

Hence the CPU is waiting until

the GPU has finished, hence the whole audio playback is being delayed

resulting in a high CPU load( ? although if the

GPU takes only about 10ms and below for each iteration

the playback is not affected at all! or at least thats what it sounds like…

and the CPU load goes down very very quickly

: ) …So the quicker the better!! Ultimately i d like

to have maybe a 3d plan if its even quicker…

N = calculated FFT size - both signals have been padded with zeros up

to N size already in the VST CPU code and are ready for transformation and convolution …

N is always in the form of : NextPowerOf2( Signal_length + Filter_length - 1 )

Copying from host to device is done in a kernel through textures and arrays

to the device variables “padded_A” and “padded_B” But this is not shown here */

extern “C” void

run_VST_CUDA_code ( int N , Complex* a , Complex* b )


    int nBlocks = N / 128;

int nThreadsT = N / nBlocks;

int nThreadsx, nThreadsy;

if(nBlocks >= 8192){	 nThreadsx  = nThreadsT / 256;	nThreadsy  = 256;}

if(nBlocks < 8192){	nThreadsx  = nThreadsT / 128;	nThreadsy  = 128;	}   // not very sure about these 

if(nBlocks < 2048){	nThreadsx  = nThreadsT / 64;	nThreadsy  = 64;	}   // if statements here

if(nBlocks < 128){	nThreadsx  = nThreadsT / 32;	nThreadsy  = 32;	}

if(nBlocks < 64){	nThreadsx  = nThreadsT / 16;	nThreadsy  = 16;	}

if(nBlocks < 32){	nThreadsx  = nThreadsT / 8;	nThreadsy  = 8;		}

if(nBlocks < 16){	nThreadsx  = nThreadsT / 4;	nThreadsy  = 4;		}

if(nBlocks < 8){		nThreadsx  = nThreadsT / 2;	nThreadsy  = 2;		}

if(nBlocks < 4){		nThreadsx  = nThreadsT / 1;	nThreadsy  = 1;		}

dim3 BlockDim(nThreadsx , nThreadsy , 1);

dim3 GridDim(nThreadsT / BlockDim.x, nBlocks / BlockDim.y, 1);

cufftHandle plan;

    cufftSafeCall(cufftPlan2d(&plan, nBlocks, nThreadsT, CUFFT_C2C));

// Both padded_A and padded_B have size N

cufftSafeCall(cufftExecC2C(plan, (cufftComplex*)padded_A,

	(cufftComplex *)padded_A, CUFFT_FORWARD));

cufftSafeCall(cufftExecC2C(plan, (cufftComplex*)padded_B, 

	(cufftComplex *)padded_B, CUFFT_FORWARD));


Convolve_and_Normalize <<< GridDim , BlockDim >>> ( padded_A, padded_B, N, nThreadsT, nBlocks);

cufftSafeCall(cufftExecC2C(plan, (cufftComplex*)padded_A,

	(cufftComplex*)padded_A, CUFFT_INVERSE));


// copy results to CPU…

// Free arrays…

    // Destroy plan... 

    // Not calling cudaThreadExit to avoid having to reinitialize the device for 

    // each audio buffer that is being sent from the host .

   //  Outputs are saved in plugin's global memory . Addition of the overlapping parts

   //  is implemented in the processReplacing() function of the VST.



And here is the kernel code.



#define IMUL(a, b ) __mul24(a, b )

device void complexMulAndScale(Complex& a, Complex b, float c)


Complex t = {c * (a.x * b.x - a.y * b.y), c * (a.y * b.x + a.x * b.y)};

a = t;


global void Convolve_and_Normalize

(Complex *d_PaddedData , Complex *d_PaddedKernel , int dataN, int width, int height)


const int x = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

const int     y = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

const int tid   = y*width + x;

const int threadNx = IMUL(blockDim.x, gridDim.x);

const int threadNy = IMUL(blockDim.y, gridDim.y);

const int threadN  = threadNx * threadNy;

const float q = 1.f / (float)dataN ;

for(int i = tid; i < dataN; i += threadN ){

	complexMulAndScale(d_PaddedData[i], d_PaddedKernel[i], q);}



Can anyone help with this pls? Is there anything wrong here with the way I retrieve the thread index? Or anything wrong with

the 2d plan and its x and y dimensions? Would a 3d plan be faster for sizes in the range of 50k - 1M elements per FFT ?

Thanks for looking,