Heglo, Iâ€™m newbie in cuda programming. It is program for finite impulse respons filter for signal processing. Actually it is difference equation and order of filter will be 256. I have about 15 millions samples - x[n]. The output signal is y [n] and coefficients of diff. equations are W. The pseudocode for the equation is something like this:
y [n] = x[n] * W[ 0 ] + x[n-1] * W + â€¦ + x[n-255] * W
I have 256 thredas in block and about 60000 blocks for cca 15e6 samples. The Xs variable is for X input signal of filter and Ys for output of filter. Xs has 512 elements because 256+ additional 256 recursive elements. The W is stored in constant cache, as that are coefficients of diff. equation. Around kernel I put some loop and it was about 50 GFLOPS. When I change in main loop 256 to 16 and I put above there another loop, the performance is about 90 GFLOPS. But still poor on 9800 GTX+. I think that for one block there are 256 threads and all those threads read from the same constant cache space and j will be incremented after all threads in block did their work. Please, can you help me to improve that code ? Thank you.
shared float Xs;
shared float Ys;
for(int j=0; j<256; j++) // main loop