CUDA OVS convolution speed-up

FLag · February 22, 2010, 6:01pm

Hi,

Does anyone have any suggestions of how to speed up this code ?
It is a convolution algorithm using the overlap-save method… Im using it
in a reverb plugin. The variables passed to the device from the CPU through
the external function contain the following:

a = audio buffer (real-time) / F domain / one block of size 2N / where N = audio buffer size
b = long impulse response / F domain / multiple blocks of size 2N
c = circular buffer / initial condition = empty / stores the convolution result(s) / only used in the device

No unnecessary allocations or memory copies are made I think. The only array that is being continuously
allocated, copied (hosttodevice), copied back (devicetohost) and freed is the input real time buffer.

I’ve tried using shared memory (see commented code in conv kernel…) but doesn’t make any difference.
My threadblocks are also usually too large to pre-allocate equivalent shared memory.
I’ve read that pinned memory can speed up CPU-GPU data exchange but my card doesn’t have any…
I’m really a beginner in GPU stuff… so I think the device code is a bit naive and maybe
can be further optimized.
The code is attached. Any suggestions are more than welcome…

Thanks a lot,
Filippos
d_kernels.cu (1.84 KB)
d_main.cu (2.69 KB)

jabelloch · March 23, 2010, 1:23am

Hi Flippos,

I see your code and I have some questions:

The signals you store in the audio buffer a and the long impulse response b, are in the frequency domain. How do you do the FFT?. Do you use the cuFFT before?
In the file d_kernels.cu, you describe three kernels: convolve, overlap and rotate. All of them have as an input argument, the parameter HEIGHT, but it is not used. Is it the same as decay?.
When you say that b=long impulse response, multiple blocks of size 2N, what do you mean?. Do you organize the impulse response in a matrix way?.

If you don’t understan any question, please, reply, I am a beginner in CUDA as well and I want to learn.

Thank You

jabelloch

:-)

Hi,

Does anyone have any suggestions of how to speed up this code ?

It is a convolution algorithm using the overlap-save method… Im using it

in a reverb plugin. The variables passed to the device from the CPU through

the external function contain the following:

a = audio buffer (real-time) / F domain / one block of size 2N / where N = audio buffer size

b = long impulse response / F domain / multiple blocks of size 2N

c = circular buffer / initial condition = empty / stores the convolution result(s) / only used in the device

No unnecessary allocations or memory copies are made I think. The only array that is being continuously

allocated, copied (hosttodevice), copied back (devicetohost) and freed is the input real time buffer.

I’ve tried using shared memory (see commented code in conv kernel…) but doesn’t make any difference.

My threadblocks are also usually too large to pre-allocate equivalent shared memory.

I’ve read that pinned memory can speed up CPU-GPU data exchange but my card doesn’t have any…

I’m really a beginner in GPU stuff… so I think the device code is a bit naive and maybe

can be further optimized.

The code is attached. Any suggestions are more than welcome…

Thanks a lot,

Filippos

Jimmy_Pettersson · March 23, 2010, 9:03am

Is your convolution anything like TDFIR or perhaps FDFIR? I’ve done implementations of such described here: [url=“http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf”]http://www.hpcsweden.se/files/RadarSignalP...sProcessors.pdf[/url]

jabelloch · March 23, 2010, 4:35pm

Hi Jimmy,

But in this TDFIR, for example, you multiply the filter with multples input vectors, and not an input vector with multiple filters ( multiple blocks of 2N, the long impulse response).

I do not really understand why the long impulse response has multiple blocks of 2N. Shoudn’t it be just one block of 2N elements having the frequency response of the filter?

Thank you very much…

jabelloch

Jimmy_Pettersson · March 23, 2010, 5:06pm

Ah, actually we did both versions. Different apps wanted slightly different things depending on channel configurations if i remember correctly.

I’m not sure why you mention 2N, can you direct me to the page and paragraf please?

jabelloch · March 24, 2010, 11:46am

Hi Jimmy,

With 2N I meant this line (in Post from Flag):

b = long impulse response / F domain / multiple blocks of size 2N

Muy doubt resolves around those multiple blocks 2*N,

Why does He use multiple blocks for a long impulse response?

Thank you ver much,

jabelloch

Jimmy_Pettersson · March 24, 2010, 12:02pm

No idea, haven’t looked at his code in detail, i dont have much time… sorry.

mcmonkeys1 · April 3, 2010, 12:44am

Hi,

Does anyone have any suggestions of how to speed up this code ?

It is a convolution algorithm using the overlap-save method… Im using it

in a reverb plugin. The variables passed to the device from the CPU through

the external function contain the following:

a = audio buffer (real-time) / F domain / one block of size 2N / where N = audio buffer size

b = long impulse response / F domain / multiple blocks of size 2N

c = circular buffer / initial condition = empty / stores the convolution result(s) / only used in the device

No unnecessary allocations or memory copies are made I think. The only array that is being continuously

allocated, copied (hosttodevice), copied back (devicetohost) and freed is the input real time buffer.

I’ve tried using shared memory (see commented code in conv kernel…) but doesn’t make any difference.

My threadblocks are also usually too large to pre-allocate equivalent shared memory.

I’ve read that pinned memory can speed up CPU-GPU data exchange but my card doesn’t have any…

I’m really a beginner in GPU stuff… so I think the device code is a bit naive and maybe

can be further optimized.

The code is attached. Any suggestions are more than welcome…

Thanks a lot,

Filippos

Had a quick look, I’m new to CUDA myself, but here’s my 2 cents:

in convolve kernel i don’t see why you don’t just use registers instead of

__shared__ Complex s_a;

	__shared__ Complex s_b;

	__shared__ Complex s_c;

as these are only being used by the current thread. apparently there is some evidence that registers are actually quicker then shared memory (contrary to programming guide) as the CUFFT library has seen speed up by using registers instead of shared mem.

Also, shouldn’t converting the buffers to and from the frequency domain be done on the GPU also?

in overlap kernel all threads do a division to produce a constant value q. even threads that don’t actually do any work.

on the host side code, you might consider creating another function to setDevice surely this should only be called once.

it might also be worth while allocating ‘permanent’ buffers for a, b, & c. could just make them extra large. allocating takes

a long time! additionally memory transfers are better off being coalesced, rather than this left right business. can i assume

these are coming from something like processReplacing(float**,float**,sampleFrames) ? it’s still worth sticking them together

as memory transfers are quicker for larger buffers for some reason. have a quick check with bandwidthTest.exe in the NVIDIA SDK if you don’t believe me. -BTW this sample program is seriously flawed unless you increase the number of iterations (define NUM_ITERS if i remember correctly) as this is set to only 10. I had some issues with this.

Incidentally, i think i’m trying to do exactly the same thing as you. Currently working on a VST effect, and basically using GPU to do all the convolving. It hasn’t been easy so far, but fun! =)

Topic		Replies	Views
convolution with CUFFT2dPLAN - errors in output GPU audio reverb plugin cufft cufft2dplan CUDA Programming and Performance	0	2608	December 6, 2009
Using Shared Memory in CUDA C/C++ Technical Blog	36	1987	October 8, 2020
VST - CUDA integration CUDA Programming and Performance	16	19938	April 29, 2010
Re_arranging Cuda array CUDA Programming and Performance	8	44	September 23, 2024
Optimizing performance of a serial <<<1, 1>>> kernel, after long debugging hours CUDA Programming and Performance	13	887	July 2, 2018
Parallel processing with large arrays CUDA Programming and Performance	9	6254	April 2, 2008
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13931	September 5, 2008
Establishing GPU processor and memory usage CUDA Programming and Performance	26	18924	November 17, 2008
interpolation and fft in labview programming strategy for the dig sig processing in a medical im. se CUDA Programming and Performance	10	4572	December 27, 2010
1d convolution performance CUDA Programming and Performance	13	7996	November 14, 2018

CUDA OVS convolution speed-up

Related topics