# Introducing cuda to convolution function

Dear all,
I am trying to introducing cuda to convolution function i have.
Following is the definition of convolution function

``````//                    H=61  LHHALF=30
//   convolve_cwp_1(    LH, -LHHALF,        h,      n,       0,        x,      n,       0,        y);
void convolve_cwp_1(int lx, int ifx, float *x, int ly, int ify, float *y, int lz, int ifz, float *z)
{
//ilx= 30         ily=n-1       ilz=n-1
int ilx=ifx+lx-1, ily=ify+ly-1, ilz=ifz+lz-1,i,j,jlow,jhigh;
float sum;

x -= ifx;  y -= ify;  z -= ifz;
//i=0 to n-1
for (i=ifz; i<=ilz; ++i) {
//ily=n-1
jlow = i-ily;  if (jlow<ifx) jlow = ifx;
jhigh = i-ify;  if (jhigh>ilx) jhigh = ilx;
for (j=jlow,sum=0.0; j<=jhigh; ++j)
sum += x[j]*y[i-j];
z[i] = sum;
}
}
``````

I have converted this function to

``````//x -= ifx;  y -= ify;  z -= ifz; this will be taken care in function call
__global__ void cuda_convoution(int lx, int ifx, float *x, int ly, int ify, float *y, int lz, int ifz, float *z, int ilx, int ily, int ilz)
{
int i, j, jlow, jhigh;
int i = threadIdx.x + blockIdx.x*blockDim.x;
if(i <= ilz)
{
jlow  = i-ily;  if (jlow<ifx)  jlow  = ifx;
jhigh = i-ify;  if (jhigh>ilx) jhigh = ilx;
for(j=jlow,sum=0.0; j<=jhigh; ++j)
sum += x[j]*y[i-j];
z[i] = sum;
}
}
``````

I am new to CUDA, Please suggest if correct any extra optimization i can done.

use code tag (last icon on top of edit box) to format your code

I found this function for convolution. Maybe it will be useful to you

``````#define THREADS_PER_BLOCK 128
__global__ void convolve(int N, float* input, float* output) {
__shared__ float support[threads_PER_BLOCK+2]; // shared accross block
int index = blockIdx.x * blockDim.x + threadIdx.x; // thread local variable
}
float result = 0.0f; // thread local variable
for (int i=0; i<3; i++)
output[index] = result / 3.f;
}
// host code //////////////////////////////////////////////
int N = 1024 * 1024;
cudaMalloc(&devInput, sizeof(float) * (N + 2)); // allocate array in device memory
cudaMalloc(&devOutput, sizeof(float) * N); // allocate array in device memory
// property initialize contents of devInput here...
``````

You really want to declare your x and y arrays as const float* restrict to use the texture caches on reads.

maybe a partial unrolling on the loop for(j=jlow,sum=0.0; j<=jhigh; ++j) will be beneficial.

try to use #pragma unroll 4

Your CUDA code does not declare the variable sum as float explicitly - if you’re unlucky you end up with default int type. Am I missing something?

What are the typical block and grid sizes that this kernel is called with?

@cbuchner, Dont worry i have declared sum as float.
I will try the #pragma unroll option. Thanks for your help.