Heglo, Iâ€™m newbie in cuda programming. It is program for finite impulse respons filter for signal processing. Actually it is difference equation and order of filter will be 256. I have about 15 millions samples - x[n]. The output signal is y [n] and coefficients of diff. equations are W. The pseudocode for the equation is something like this:

y [n] = x[n] * W[ 0 ] + x[n-1] * W[1] + â€¦ + x[n-255] * W[255]

I have 256 thredas in block and about 60000 blocks for cca 15e6 samples. The Xs variable is for X input signal of filter and Ys for output of filter. Xs has 512 elements because 256+ additional 256 recursive elements. The W is stored in constant cache, as that are coefficients of diff. equation. Around kernel I put some loop and it was about 50 GFLOPS. When I change in main loop 256 to 16 and I put above there another loop, the performance is about 90 GFLOPS. But still poor on 9800 GTX+. I think that for one block there are 256 threads and all those threads read from the same constant cache space and j will be incremented after all threads in block did their work. Please, can you help me to improve that code ? Thank you.

**shared** float Xs[512];

**shared** float Ys[256];

int id=blockIdx.x*bdim+threadIdx.x;

int idx=threadIdx.x;

int idx2=idx+256;

Xs[idx2]=Xd[id+256];

Xs[idx]=Xd[id];

Ys[idx]=0.0;

__syncthredas( );

for(int j=0; j<256; j++) // main loop

{

Ys[idx]+=Xs[idx2]*W[j];

}

__syncthreads( );

Yd[id]=Ys[idx];