the code runs only at 90 GFLOPS

Heglo, I’m newbie in cuda programming. It is program for finite impulse respons filter for signal processing. Actually it is difference equation and order of filter will be 256. I have about 15 millions samples - x[n]. The output signal is y [n] and coefficients of diff. equations are W. The pseudocode for the equation is something like this:
y [n] = x[n] * W[ 0 ] + x[n-1] * W[1] + … + x[n-255] * W[255]
I have 256 thredas in block and about 60000 blocks for cca 15e6 samples. The Xs variable is for X input signal of filter and Ys for output of filter. Xs has 512 elements because 256+ additional 256 recursive elements. The W is stored in constant cache, as that are coefficients of diff. equation. Around kernel I put some loop and it was about 50 GFLOPS. When I change in main loop 256 to 16 and I put above there another loop, the performance is about 90 GFLOPS. But still poor on 9800 GTX+. I think that for one block there are 256 threads and all those threads read from the same constant cache space and j will be incremented after all threads in block did their work. Please, can you help me to improve that code ? Thank you.

shared float Xs[512];
shared float Ys[256];
int id=blockIdx.x*bdim+threadIdx.x;
int idx=threadIdx.x;
int idx2=idx+256;
__syncthredas( );
for(int j=0; j<256; j++) // main loop
__syncthreads( );

One piece of advice would be to unroll your for loop partially

Where does W reside? Is it in global or texture memory?

W is in constant memory

cbuchner1 thanks, when used unrolling, perfomance + 22% to 110 GFLOPS

I presume the reads from memory are bottlenecking you a bit. Frankly, there are few floating point ops being done and quite a lot of memory accesses - you have a single MAD for 3 memory accesses in this loop. Since it’s constant and shared memory, it should be fast but it will never be that fast. I’ve had similar code that read ~200 elements from cmem and also got around 95GFLOPS (and I had a few more arithmetic ops between reads) on an 8800 GTS 512 (basically the same thing as your 9800). 120 after unrolling.

I didn’t use shared memory there so the common culprit seems constant memory. It’s still faster than if we tried global memory.

To get lots and lots of flops, you need algorithms that are arithmetically intensive to saturate the streaming processors and not rely on memory buses (this also goes for CPU algorithms by the way). Which means that FLOPS aren’t a useful measurement of performance (surprise, surprise!) because they are entirely algorithm-dependant.

I say don’t look at flops (at all!), look at the speed-ups against multicore CPUs. Make sure to vectorize and multithread the CPU versions.

Another advice I could give you is the following: look at the resulting PTX! (use the --keep compile option)

Look for inefficiencies, like for example needless computations of indices.

Sometimes the “volatile” trick helps to move some redundant computations out of a time critical inner loop. Declare some local variables volatile (index variables, etc) and see if that makes a difference in the PTX code.

Maybe having one index variable j is inefficient if you try to access [j], [j+1], [j+2], … [j+15] in your unrolled loop. Maybe accessing through pointers instead of accessing by means of an index variable would be more efficient (faster?).

Also try to profile this thing. See if the CUDA profiler reports uncoalesced reads or bank conflicts in shared memory.

is “+=Xs[idx2]*W[j];” ok?

If you’re unlucky, the compiler will compute idx+256 in the inner loop over and over (check the PTX).

A possible way to fix it would be to declare it as volatile int idx2 = idx+256;

As bog already pointed out, there’s something wrong with the code. The indices for both the Ys and Xs array within the for loop are independent of the iterator.

Also, when performing signal filtering with large filter kernels, I think you’re better of performing the convolution in the frequency domain using FFTs.
A good FFT implementation can perform the conversion to frequency domain or the inverse transform in O(NlogN) operations.
This has a complexity of:
O(5129 (to frequency domain) + 512 (convolution) + 5129 (to time domain)) = O(51219) for one batch
whether in time domain, the complexity is O(256

You can use overlap-save or overlap-add method for performing the convolution in small batches.


As someone else has pointed it out the main loop should be:

for(int j=0; j<256; j++) // main loop





However I would share a thinking about the algorithm optimization.

You are letting each thread evaluate a result. But in this way you do not parallelize the accesses to Xs - there is where you are reusing data: if you do not parallelize it you will not get much improvement: the multiprocessors will execute indipendent operations sequentially, because this is how you have written it. That’s not what you want with GPUs. Even if you unroll it, they are subsequent operation of one thread.

I think you should let each thread work on a different tap of the filter, so all the threads can work on the ADJACENT items in shared memory. Since every thread will work on a different tap, just load the thread tap coefficient on a register (a local var) before the loop. In this way up to 256 threads can work altogether on the same task.

In this way you are encountering another problem: now you need to reduce the summation to evaluate Y[ids] - now different threads eval its addends, so you need synchronization. However this is light, since all threads are synchronous. You can do it with a binary tree - see the reduction sample…c/reduction.pdf. However, as in reduction, I think you will get the best with an hybrid method among your an this: each thread eval SOME elements on taps kept on SOME registers, and then they sum all them together.

I think in this way you have some more chances to have it closer to peak performance.