CUDA OVS convolution speed-up


Does anyone have any suggestions of how to speed up this code ?
It is a convolution algorithm using the overlap-save method… Im using it
in a reverb plugin. The variables passed to the device from the CPU through
the external function contain the following:

a = audio buffer (real-time) / F domain / one block of size 2N / where N = audio buffer size
b = long impulse response / F domain / multiple blocks of size 2N
c = circular buffer / initial condition = empty / stores the convolution result(s) / only used in the device

No unnecessary allocations or memory copies are made I think. The only array that is being continuously
allocated, copied (hosttodevice), copied back (devicetohost) and freed is the input real time buffer.

I’ve tried using shared memory (see commented code in conv kernel…) but doesn’t make any difference.
My threadblocks are also usually too large to pre-allocate equivalent shared memory.
I’ve read that pinned memory can speed up CPU-GPU data exchange but my card doesn’t have any…
I’m really a beginner in GPU stuff… so I think the device code is a bit naive and maybe
can be further optimized.
The code is attached. Any suggestions are more than welcome…

Thanks a lot,
Filippos (1.84 KB) (2.69 KB)

Hi Flippos,

I see your code and I have some questions:

  • The signals you store in the audio buffer a and the long impulse response b, are in the frequency domain. How do you do the FFT?. Do you use the cuFFT before?

  • In the file, you describe three kernels: convolve, overlap and rotate. All of them have as an input argument, the parameter HEIGHT, but it is not used. Is it the same as decay?.

  • When you say that b=long impulse response, multiple blocks of size 2N, what do you mean?. Do you organize the impulse response in a matrix way?.

If you don’t understan any question, please, reply, I am a beginner in CUDA as well and I want to learn.

Thank You



Is your convolution anything like TDFIR or perhaps FDFIR? I’ve done implementations of such described here:…sProcessors.pdf

Hi Jimmy,

But in this TDFIR, for example, you multiply the filter with multples input vectors, and not an input vector with multiple filters ( multiple blocks of 2N, the long impulse response).

I do not really understand why the long impulse response has multiple blocks of 2N. Shoudn’t it be just one block of 2N elements having the frequency response of the filter?

Thank you very much…


Ah, actually we did both versions. Different apps wanted slightly different things depending on channel configurations if i remember correctly.

I’m not sure why you mention 2N, can you direct me to the page and paragraf please?

Hi Jimmy,

With 2N I meant this line (in Post from Flag):

b = long impulse response / F domain / multiple blocks of size 2N

Muy doubt resolves around those multiple blocks 2*N,

Why does He use multiple blocks for a long impulse response?

Thank you ver much,


No idea, haven’t looked at his code in detail, i dont have much time… sorry.

Had a quick look, I’m new to CUDA myself, but here’s my 2 cents:

in convolve kernel i don’t see why you don’t just use registers instead of

__shared__ Complex s_a;

	__shared__ Complex s_b;

	__shared__ Complex s_c;

as these are only being used by the current thread. apparently there is some evidence that registers are actually quicker then shared memory (contrary to programming guide) as the CUFFT library has seen speed up by using registers instead of shared mem.

Also, shouldn’t converting the buffers to and from the frequency domain be done on the GPU also?

in overlap kernel all threads do a division to produce a constant value q. even threads that don’t actually do any work.

on the host side code, you might consider creating another function to setDevice surely this should only be called once.

it might also be worth while allocating ‘permanent’ buffers for a, b, & c. could just make them extra large. allocating takes

a long time! additionally memory transfers are better off being coalesced, rather than this left right business. can i assume

these are coming from something like processReplacing(float**,float**,sampleFrames) ? it’s still worth sticking them together

as memory transfers are quicker for larger buffers for some reason. have a quick check with bandwidthTest.exe in the NVIDIA SDK if you don’t believe me. -BTW this sample program is seriously flawed unless you increase the number of iterations (#define NUM_ITERS if i remember correctly) as this is set to only 10. I had some issues with this.

Incidentally, i think i’m trying to do exactly the same thing as you. Currently working on a VST effect, and basically using GPU to do all the convolving. It hasn’t been easy so far, but fun! =)