Hi there, I like to ask the community for ideas on how to optimize a simple convolution algorithm. Here is the CPU code:

[codebox]void convolve_cpu( const float* a, const float* b, float* c, const unsigned int src_n )

{

```
for( unsigned int i = 0; i < src_n; ++i )
{
for( unsigned int j = 0; j < src_n; ++j )
{
c[ i + j ] += a[i] * b[j];
}
}
```

}

[/codebox]

There are two input float arrays ( a and b ) which are convolved into an output array ( c ). The length of the output array is 2 * src_n - 1. src_n is the length of an input array.

For my project we can easily assume that all data, input and output, will fit into shared memory on my Tesla C2050. It’s 48KB per Streaming Multiprocessor. Also, I’m dealing with lots of convolution all the time but the number of points per convolution is relatively small ( 0 - 500 ). Hence, the reason I decided that one convolution should be handled by one Streaming Multiprocessor alone. Does that makes sense?

Now, my implementation for the GPU looks quite different since I try to parallelize the calculation as much as possible. Let me just show the code before I go into more details.

[codebox]**global** void convolve_gpu_case_2( const float_t* src_1

```
, const float_t* src_2
, float_t* dst
, const unsigned int src_n
, const unsigned int dst_n
, const unsigned int threads_n
)
```

{

```
__shared__ float_t src_1_temp[ 0xc00 ];
__shared__ float_t src_2_temp[ 0xc00 ];
__shared__ float_t dst_temp [ 0x1800 ];
```

int tid = threadIdx.x + blockIdx.x * blockDim.x;

// 1. Copy sources to shared memory

unsigned int index = tid;

```
while( index < src_n )
{
src_1_temp[index] = src_1[index];
```

index += threads_n;

```
}
```

index = tid;

```
while( index < src_n )
{
src_2_temp[index] = src_2[index];
```

index += threads_n;

```
}
```

// Overwrite shared from previous values

```
index = tid;
while( index < dst_n )
{
dst_temp[index] = 0.f;
```

index += threads_n;

```
}
```

__syncthreads();

// 2. Convolve

index = tid;

```
while( index < dst_n )
{
unsigned int num_operations = ( index < src_n ) ? index + 1 : dst_n - index;
unsigned int index_src_1 = ( index < src_n ) ? 0 : index - src_n + 1;
unsigned int index_src_2 = ( index < src_n ) ? index : src_n - 1;
```

for( unsigned int j = 0; j < num_operations; ++j )

```
{
dst_temp[index] += src_1_temp[index_src_1] * src_2_temp[index_src_2];
```

index_src_1++;

```
index_src_2--;
}
```

index += threads_n;

```
}
```

__syncthreads();

// 3. Copy back the result to global memory

index = tid;

```
while( index < dst_n )
{
dst[index] = dst_temp[index];
```

index += threads_n;

```
}
```

}[/codebox]

The first step in my implementation is to fill up the shared memory with the two input arrays. I also have to reset to 0 the output shared memory. When I don’t do that then there are already values which mess up my results.

The next step is a little harder to describe. As you can see in the CPU implementation there are two nested for loops. Inside the inner for loop we are repeatedly adding a value to one of the elements in the output array. This introduces a data dependence problem since there might be threads writing and reading the same element at same time. To avoid such data dependence one could create a temporary square matrix whereas each row is fill by each inner for loop without the addition. Once that is done the output array can be filled by taking the matrix and diagonally added up values. For instance.

output[0] = mat[0,0]

output[1] = mat[1,0] + mat[0,1]

output[2] = mat[2,0] + mat[1,1] + mat[0,2]

output[3] = mat[3,0] + mat[2,1] + mat[1,2] + mat[0,3]

etc.

This basically sums up my first attempt to optimize my convolution algorithm. The next iteration goes a little further.

Here, I don’t allocate the matrix anymore but rather each thread represents a diagonal line through the matrix. For instance:

thread 0: output[0] = src1[0] * src2[0]

thread 1: output[1] = src1[1] * src2[0] + src1[0] * src2[1]

thread 2: output[2] = src1[2] * src2[0] + src1[1] * src2[1] + src1[0] * src2[2]

thread 3: output[3] = src1[3] * src2[0] + src1[2] * src2[1] + src1[1] * src2[2] + src1[0] * src2[3];

I hope I could describe my solution sufficiently. Just ask me if you couldn’t follow.

I’m interested to see if you might have an even better solution to my problem. Any suggestions are welcome.

I’m using a Tesla C2050 card.

Regards,

Christian