Does anyone have any suggestions of how to speed up this code ?
It is a convolution algorithm using the overlap-save method… Im using it
in a reverb plugin. The variables passed to the device from the CPU through
the external function contain the following:
a = audio buffer (real-time) / F domain / one block of size 2N / where N = audio buffer size
b = long impulse response / F domain / multiple blocks of size 2N
c = circular buffer / initial condition = empty / stores the convolution result(s) / only used in the device
No unnecessary allocations or memory copies are made I think. The only array that is being continuously
allocated, copied (hosttodevice), copied back (devicetohost) and freed is the input real time buffer.
I’ve tried using shared memory (see commented code in conv kernel…) but doesn’t make any difference.
My threadblocks are also usually too large to pre-allocate equivalent shared memory.
I’ve read that pinned memory can speed up CPU-GPU data exchange but my card doesn’t have any…
I’m really a beginner in GPU stuff… so I think the device code is a bit naive and maybe
can be further optimized.
The code is attached. Any suggestions are more than welcome…
as these are only being used by the current thread. apparently there is some evidence that registers are actually quicker then shared memory (contrary to programming guide) as the CUFFT library has seen speed up by using registers instead of shared mem.
Also, shouldn’t converting the buffers to and from the frequency domain be done on the GPU also?
in overlap kernel all threads do a division to produce a constant value q. even threads that don’t actually do any work.
on the host side code, you might consider creating another function to setDevice surely this should only be called once.
it might also be worth while allocating ‘permanent’ buffers for a, b, & c. could just make them extra large. allocating takes
a long time! additionally memory transfers are better off being coalesced, rather than this left right business. can i assume
these are coming from something like processReplacing(float**,float**,sampleFrames) ? it’s still worth sticking them together
as memory transfers are quicker for larger buffers for some reason. have a quick check with bandwidthTest.exe in the NVIDIA SDK if you don’t believe me. -BTW this sample program is seriously flawed unless you increase the number of iterations (#define NUM_ITERS if i remember correctly) as this is set to only 10. I had some issues with this.
Incidentally, i think i’m trying to do exactly the same thing as you. Currently working on a VST effect, and basically using GPU to do all the convolving. It hasn’t been easy so far, but fun! =)