optimizing an convolution algorithm

chhenning_air · September 22, 2010, 9:22pm

Hi there, I like to ask the community for ideas on how to optimize a simple convolution algorithm. Here is the CPU code:

[codebox]void convolve_cpu( const float* a, const float* b, float* c, const unsigned int src_n )

{

for( unsigned int i = 0; i < src_n; ++i )

{

    for( unsigned int j = 0; j < src_n; ++j )

    {

        c[ i + j ] += a[i] * b[j];

    }

}

}

[/codebox]

There are two input float arrays ( a and b ) which are convolved into an output array ( c ). The length of the output array is 2 * src_n - 1. src_n is the length of an input array.

For my project we can easily assume that all data, input and output, will fit into shared memory on my Tesla C2050. It’s 48KB per Streaming Multiprocessor. Also, I’m dealing with lots of convolution all the time but the number of points per convolution is relatively small ( 0 - 500 ). Hence, the reason I decided that one convolution should be handled by one Streaming Multiprocessor alone. Does that makes sense?

Now, my implementation for the GPU looks quite different since I try to parallelize the calculation as much as possible. Let me just show the code before I go into more details.

[codebox]global void convolve_gpu_case_2( const float_t* src_1

                               , const float_t*     src_2

                               , float_t*           dst

                               , const unsigned int src_n

                               , const unsigned int dst_n

                               , const unsigned int threads_n

                               )

{

__shared__ float_t src_1_temp[ 0xc00  ];

__shared__ float_t src_2_temp[ 0xc00  ];

__shared__ float_t dst_temp  [ 0x1800 ];

int tid = threadIdx.x + blockIdx.x * blockDim.x;

// 1. Copy sources to shared memory

unsigned int index = tid;

while( index < src_n )

{

    src_1_temp[index] = src_1[index];

index += threads_n;

index = tid;

while( index < src_n )

{

    src_2_temp[index] = src_2[index];

index += threads_n;

// Overwrite shared from previous values

index = tid;

while( index < dst_n )

{

    dst_temp[index] = 0.f;

index += threads_n;

__syncthreads();

// 2. Convolve

index = tid;

while( index < dst_n )

{

    unsigned int num_operations = ( index < src_n ) ? index + 1 : dst_n - index;

    unsigned int index_src_1    = ( index < src_n ) ?         0 : index - src_n + 1;

    unsigned int index_src_2    = ( index < src_n ) ?     index :         src_n - 1;

for( unsigned int j = 0; j < num_operations; ++j )

    {

        dst_temp[index] += src_1_temp[index_src_1] * src_2_temp[index_src_2];

index_src_1++;

        index_src_2--;

    }

index += threads_n;

__syncthreads();

// 3. Copy back the result to global memory

index = tid;

while( index < dst_n )

{

    dst[index] = dst_temp[index];

index += threads_n;

}[/codebox]

The first step in my implementation is to fill up the shared memory with the two input arrays. I also have to reset to 0 the output shared memory. When I don’t do that then there are already values which mess up my results.

The next step is a little harder to describe. As you can see in the CPU implementation there are two nested for loops. Inside the inner for loop we are repeatedly adding a value to one of the elements in the output array. This introduces a data dependence problem since there might be threads writing and reading the same element at same time. To avoid such data dependence one could create a temporary square matrix whereas each row is fill by each inner for loop without the addition. Once that is done the output array can be filled by taking the matrix and diagonally added up values. For instance.

output[0] = mat[0,0]

output[1] = mat[1,0] + mat[0,1]

output[2] = mat[2,0] + mat[1,1] + mat[0,2]

output[3] = mat[3,0] + mat[2,1] + mat[1,2] + mat[0,3]

etc.

This basically sums up my first attempt to optimize my convolution algorithm. The next iteration goes a little further.

Here, I don’t allocate the matrix anymore but rather each thread represents a diagonal line through the matrix. For instance:

thread 0: output[0] = src1[0] * src2[0]

thread 1: output[1] = src1[1] * src2[0] + src1[0] * src2[1]

thread 2: output[2] = src1[2] * src2[0] + src1[1] * src2[1] + src1[0] * src2[2]

thread 3: output[3] = src1[3] * src2[0] + src1[2] * src2[1] + src1[1] * src2[2] + src1[0] * src2[3];

I hope I could describe my solution sufficiently. Just ask me if you couldn’t follow.

I’m interested to see if you might have an even better solution to my problem. Any suggestions are welcome.

I’m using a Tesla C2050 card.

Regards,

Christian

Topic		Replies	Views
Introducing cuda to convolution function CUDA Programming and Performance	5	926	April 16, 2018
CUDA OVS convolution speed-up CUDA Programming and Performance	7	2694	April 3, 2010
Performance issue CUDA Programming and Performance	3	2165	June 18, 2008
Optimization of kernel for batch convolution of many small matrices CUDA Programming and Performance	4	1835	August 1, 2013
2D convolutions with changing kernel but fixed data CUDA Programming and Performance	5	8418	December 16, 2008
Cuda Convolution - best memory useage CUDA Programming and Performance	3	971	March 30, 2012
convolution using shared memory slowdown instead of speedup... CUDA Programming and Performance	1	5629	March 11, 2010
Image Convolution CUDA Programming and Performance	1	2031	December 30, 2008
Optimizing color channels in image processing (Gaussian blur) CUDA Programming and Performance	4	1961	June 18, 2017
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	14196	September 5, 2008

optimizing an convolution algorithm

Related topics