Introducing cuda to convolution function

Dear all,
I am trying to introducing cuda to convolution function i have.
Following is the definition of convolution function

//                    H=61  LHHALF=30
//   convolve_cwp_1(    LH, -LHHALF,        h,      n,       0,        x,      n,       0,        y);
void convolve_cwp_1(int lx, int ifx, float *x, int ly, int ify, float *y, int lz, int ifz, float *z)
{
    //ilx= 30         ily=n-1       ilz=n-1
    int ilx=ifx+lx-1, ily=ify+ly-1, ilz=ifz+lz-1,i,j,jlow,jhigh;
    float sum;

    x -= ifx;  y -= ify;  z -= ifz;
    //i=0 to n-1
    for (i=ifz; i<=ilz; ++i) {
        //ily=n-1
        jlow = i-ily;  if (jlow<ifx) jlow = ifx;
        jhigh = i-ify;  if (jhigh>ilx) jhigh = ilx;
        for (j=jlow,sum=0.0; j<=jhigh; ++j)
            sum += x[j]*y[i-j];
        z[i] = sum;
    }
}

I have converted this function to

//x -= ifx;  y -= ify;  z -= ifz; this will be taken care in function call
__global__ void cuda_convoution(int lx, int ifx, float *x, int ly, int ify, float *y, int lz, int ifz, float *z, int ilx, int ily, int ilz)
{
    int i, j, jlow, jhigh;
    int i = threadIdx.x + blockIdx.x*blockDim.x;     
    if(i <= ilz)
    {
          jlow  = i-ily;  if (jlow<ifx)  jlow  = ifx;
          jhigh = i-ify;  if (jhigh>ilx) jhigh = ilx;
          for(j=jlow,sum=0.0; j<=jhigh; ++j)
              sum += x[j]*y[i-j];
        z[i] = sum; 
    }
}

I am new to CUDA, Please suggest if correct any extra optimization i can done.

use code tag (last icon on top of edit box) to format your code

Please see the updated post

I found this function for convolution. Maybe it will be useful to you

#define THREADS_PER_BLOCK 128
__global__ void convolve(int N, float* input, float* output) {
__shared__ float support[threads_PER_BLOCK+2]; // shared accross block
int index = blockIdx.x * blockDim.x + threadIdx.x; // thread local variable
support[threadIdx.x] = input[index];
if (threadIdx.x <2) {
support[THREADS_PER_BLOCK + threadIdx.x] = input[index + THREADS_PER_BLOCK];
}
__syncthreads();
float result = 0.0f; // thread local variable
for (int i=0; i<3; i++)
result += support[threadIdx.x +i];
output[index] = result / 3.f;
}
// host code //////////////////////////////////////////////
int N = 1024 * 1024;
cudaMalloc(&devInput, sizeof(float) * (N + 2)); // allocate array in device memory
cudaMalloc(&devOutput, sizeof(float) * N); // allocate array in device memory
// property initialize contents of devInput here...
concolve<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(N, devInput, devOutput);

You really want to declare your x and y arrays as const float* restrict to use the texture caches on reads.

maybe a partial unrolling on the loop for(j=jlow,sum=0.0; j<=jhigh; ++j) will be beneficial.

try to use #pragma unroll 4

Your CUDA code does not declare the variable sum as float explicitly - if you’re unlucky you end up with default int type. Am I missing something?

What are the typical block and grid sizes that this kernel is called with?

@cbuchner, Dont worry i have declared sum as float.
I will try the #pragma unroll option. Thanks for your help.