Hello,
I ve been trying to write a real-time VST impulse response reverb plug in using cufft for the FFT transforms. I ve managed to make it work with a 1 dimensional plan
but it takes quite a while and I get a CPU load in the range of 30 - 80% , depending on the impulse
response(IR) array size.
I ve seen that 2dimensional plans take much less time, and I tried to implement one. The problem is that I m getting a degraded output. Sounds like if you re decreasing the bit-rate of a given wav file.
I believe it might be because of the way I normalize the convolved signal but can’t figure out whats wrong … Theres no error or anything if the buffers are not convolved but just copied and sent back to the host… Or if they are convolved using a 1d cufft plan.
I ve been over and over any available online tutorials and here is the device code and the convolution kernel i ve ended up with( Note that the same code works fine for a 1d transform but loads the CPU a lot ).
“external_VST_CUDA.cu”:
[codebox]
/*
This function is being called in the plugin through a worker thread …
Hence the CPU is waiting until
the GPU has finished, hence the whole audio playback is being delayed
resulting in a high CPU load( ? although if the
GPU takes only about 10ms and below for each iteration
the playback is not affected at all! or at least thats what it sounds like…
and the CPU load goes down very very quickly
: ) …So the quicker the better!! Ultimately i d like
to have maybe a 3d plan if its even quicker…
N = calculated FFT size - both signals have been padded with zeros up
to N size already in the VST CPU code and are ready for transformation and convolution …
N is always in the form of : NextPowerOf2( Signal_length + Filter_length - 1 )
Copying from host to device is done in a kernel through textures and arrays
to the device variables “padded_A” and “padded_B” But this is not shown here */
extern “C” void
run_VST_CUDA_code ( int N , Complex* a , Complex* b )
{
int nBlocks = N / 128;
int nThreadsT = N / nBlocks;
int nThreadsx, nThreadsy;
if(nBlocks >= 8192){ nThreadsx = nThreadsT / 256; nThreadsy = 256;}
if(nBlocks < 8192){ nThreadsx = nThreadsT / 128; nThreadsy = 128; } // not very sure about these
if(nBlocks < 2048){ nThreadsx = nThreadsT / 64; nThreadsy = 64; } // if statements here
if(nBlocks < 128){ nThreadsx = nThreadsT / 32; nThreadsy = 32; }
if(nBlocks < 64){ nThreadsx = nThreadsT / 16; nThreadsy = 16; }
if(nBlocks < 32){ nThreadsx = nThreadsT / 8; nThreadsy = 8; }
if(nBlocks < 16){ nThreadsx = nThreadsT / 4; nThreadsy = 4; }
if(nBlocks < 8){ nThreadsx = nThreadsT / 2; nThreadsy = 2; }
if(nBlocks < 4){ nThreadsx = nThreadsT / 1; nThreadsy = 1; }
dim3 BlockDim(nThreadsx , nThreadsy , 1);
dim3 GridDim(nThreadsT / BlockDim.x, nBlocks / BlockDim.y, 1);
cufftHandle plan;
cufftSafeCall(cufftPlan2d(&plan, nBlocks, nThreadsT, CUFFT_C2C));
// Both padded_A and padded_B have size N
cufftSafeCall(cufftExecC2C(plan, (cufftComplex*)padded_A,
(cufftComplex *)padded_A, CUFFT_FORWARD));
cufftSafeCall(cufftExecC2C(plan, (cufftComplex*)padded_B,
(cufftComplex *)padded_B, CUFFT_FORWARD));
cudaThreadSynchronize();
Convolve_and_Normalize <<< GridDim , BlockDim >>> ( padded_A, padded_B, N, nThreadsT, nBlocks);
cufftSafeCall(cufftExecC2C(plan, (cufftComplex*)padded_A,
(cufftComplex*)padded_A, CUFFT_INVERSE));
cudaThreadSynchronize();
// copy results to CPU…
// Free arrays…
// Destroy plan...
// Not calling cudaThreadExit to avoid having to reinitialize the device for
// each audio buffer that is being sent from the host .
// Outputs are saved in plugin's global memory . Addition of the overlapping parts
// is implemented in the processReplacing() function of the VST.
}
[/codebox]
And here is the kernel code.
“external_VST_CUDA_kernel.cu”:
[codebox]
#define IMUL(a, b ) __mul24(a, b )
device void complexMulAndScale(Complex& a, Complex b, float c)
{
Complex t = {c * (a.x * b.x - a.y * b.y), c * (a.y * b.x + a.x * b.y)};
a = t;
}
global void Convolve_and_Normalize
(Complex *d_PaddedData , Complex *d_PaddedKernel , int dataN, int width, int height)
{
const int x = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;
const int y = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;
const int tid = y*width + x;
const int threadNx = IMUL(blockDim.x, gridDim.x);
const int threadNy = IMUL(blockDim.y, gridDim.y);
const int threadN = threadNx * threadNy;
const float q = 1.f / (float)dataN ;
for(int i = tid; i < dataN; i += threadN ){
complexMulAndScale(d_PaddedData[i], d_PaddedKernel[i], q);}
}
[/codebox]
Can anyone help with this pls? Is there anything wrong here with the way I retrieve the thread index? Or anything wrong with
the 2d plan and its x and y dimensions? Would a 3d plan be faster for sizes in the range of 50k - 1M elements per FFT ?
Thanks for looking,
Filippos