CUDA vs Intel IPP in signal processing

Hi Everybody. I’m trying to use CUDA in filtering a digital signal. Here is the formula:

x[n] is the input signal,

y[n] is the output signal, and

b[sub]i[/sub] are the filter coefficients.

The number of coefficients is 512, and 1048576 for the input signal.

Here is the Code on c++:

for (int i=n_inf-1; i>=0; i--)



		if (i<n_koef_ch) n_koef_ch=i+1;

		if (n_koef_ch==0) n_koef_ch=1;

		for (int j=0;j<n_koef_ch;j++)





The Code on Cuda(all the coefficients are in the constant memory, in dc_koef array):


dim3 threads (n_koef);

dim3 blocks  ( n_inf/threads.x );

mult<<<blocks, threads>>>(d_inf,d_outp,n_koef,n_inf);


__global__ void mult(float* d_inf, float* d_outp, int n_koef, int n_inf)


	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	__shared__ float as [1024];





	for(int i=0;i<n_koef;i++)






And using IPP signal processing library

__host__ void ippfilt(int n_inf, int n_koef, float* inf, float* koef, float* outp)


	IppStatus status;

	IppsFIRState_32f *fctx;

	ippsFIRInitAlloc_32f( &fctx, koef, n_koef, NULL );

	status=ippsFIR_32f(inf, outp, n_inf, fctx);


Hardware: Intеl Corе™2 Duo (Е4500) 2.2 Ghz; GeForce 9600GT Palit Sonic

CUDA function runs 43 times faster then the code on c++, but about 6 times slower then when using IPP library.

Please, tell me, what i’m doing wrong?

whats the dc_koef?

I don’t know if the IPP library is taking advantage of both cores on your CPU, but it is almost certainly taking advantage of the SIMD instructions to get a big increase in speed.

Also, what are your actual speeds for computing the result with each method (in milliseconds, or whatever)?


profquail: my version of IPP is using only one core.

Actual times:

~4s for processor;

40ms for GPU;

10ms for IPP;

From personal experience, attempting to make a CUDA kernel that out performs IPP on small datasets is very, VERY difficult. In fact I’m yet to really beat IPP in all cases with ANY of our CUDA kernels (except IPPI, in which case GPUs get a pretty massive performance advantage with texture sampling & caching hardware).

If you’re dealing with datasets smaller than say 1-2MB, or if your CPU code tends to take a matter of milli/micro seconds (eg: not seconds) - you’re seriously better off rolling your own heavily specialized SSE2 (and 3/4.x if it applies to you) code, single core - PCI Latency & CUDA kernel invocation overhead will kill your performance (in my experience) for such cases.

In many cases we’ve been able to get a 2-10x perfomance increase over IPP with our own CPU SIMD code (depending on function).

There is something fishy about the timings. The job to be done is 512 million madd, right? An E4500 @2.2GHz using both cores and optimized for sse can bruteforce an FIR like that in perhaps 30 ms. If IPP does it in only 10 ms, then I would guess that there is an FFT involved somewhere.

Also, the single threaded version ought to execute approximately 8x slower, but your’s is more than 100x slower … Do you have a lot of denormals in your input data?

If the C++ compiler targets the x86’s floating point unit, it WILL be much slower on Intel CPUs. I’ve done benchmarks for this, had non-parallelizable a loop doing flops and compiled it with and without SSE. The SSE version did not parallelize computations, it went through single floats one by one (wasting 3/4 of the vector) and yet it was IIRC over 15x faster then the non-SSE version. I’ve inspected assemblies to check if there weren’t any shenanigans but found none.

I’ve tested it on something around 12 CPUs, AMDs and Intels. AMDs don’t suffer from this (practically no difference between SSE and non-SSE on non-parallelizable flop loops). Each Intel I’ve tested (from Pentium IV to 45nm dualcores) had a major performance issue with x86’s FPU. Perhaps it had denormals, this benchmark of mine.

From thereon I always check the “use sse2 extensions” options in the compiler.

Now I can see how this 15x times 2x cores times 4x vector width can potentially yield stupid amounts of speedups. The best I’ve achieved in a real application was 45x: OpenMP on two cores + SSE2 enabled in compiler versus naive looping gave 200ms vs 9 seconds (CUDA took 13ms by the way). This app didn’t have denormals.

I couldn’t help taking a closer look at the problem, letting ICC set a baseline for what an expected level of performance could look like. Except for that the order of coefficients have been inversed (so that both arrays are traversed in the same direction) and the dataset is 10 times bigger (to get some accuracy in time measures) this variant should be equivalent to the original source. On an E1200 @1.2GHz - with SSE and both cores running - I clock in at from 3.6 seconds down to 2.8 seconds for the full dataset (worth 10 Gflop) depending on how close to realtime I am aiming at (128 - 1024 samples latency), which translates to 360 downto 280 ms compared to the timings mentioned in previous posts. Although the coefficients along with the current chunk of data I/O fits nicely in cache L1, this is still 5x more than what one would have expected for double precision by blindly staring at peak arithmetic performance only:

[codebox][font=“Lucida Console”]

/** brute_fir.c

  •       P
  • y[n] = Z b[i] * x[n+i - P]

  •      i=0

icc -parallel -O2 -msse3 -march=core2 -o brute_fir_icc brute_fir.c


typedef double fp_t;

#define P (512)

#define BUFSZ (128) /* 1.3 ms @ 96KHz samplerate */

void brute_P(fp_t *y,const fp_t *x,const fp_t *B)


int i, n;

for(n = 0;n < BUFSZ;n++)


  y[n] = 0;

  for(i = 0; i < P; i++)

y[n] +=  b[i] * x[n+i];



fp_t B[P] ; // coefficients.

fp_t X[P+BUFSZ]; // input

fp_t Y[BUFSZ]; // output

#include [/font][font=“Lucida Console”]

#include [/font][font=“Lucida Console”]

int main(void)


/* I need to use both stick and carrot to get

  • this realtime kernel to understand that it

  • should pay at least some attention to the

  • application. This may or may not apply to

  • your setup.


static struct sched_param schp;

schp.sched_priority = 80; // = 40;

sched_setscheduler(0, SCHED_FIFO, &schp);

usleep(10000); // yield for system drivers …

int i;

// set up coefficients (in reverse order)

i = P;


X[i] = 0, B[i] = i*0.00001f;

/* A 1988 CRAY 2 could not have completed this

  • loop in less than 5 seconds. An E1200 from

  • 2008 will beat that by at least 30%.

  • Your saving: $25 million and 200 KW of power :-)


i=10 * (1048576/BUFSZ);



  int j;

  // keep the previos last P input

  for(j=0;j < P;j++)

X[j] = X[BUFSZ+j];

// fetch new input

  for(j=0;j < BUFSZ;j++)

X[P+j] = i*0.00001f;

usleep(4); // system time …

  brute_P( X , Y , B );


return 0;



Surprisingly there is no advantage in switching to 32 bit floats … unless you cheat the permute unit and fake an easier access pattern where everything aligns almost perfectly, in which case the thruput is doubled.

The question now is: Can CUDA shared memory and/or texture cache give enough permute power to beat ICC? By how much? For comparison, an E1200 has approximately the same bandwidth, cache and peak performance as a single CUDA multi-processor, so by escape15’s example it has been shown that it is at least possible to do just as well - relative to available resources, that is.

It also depend if you generate a code for 64-bit platform. For 64-bit you get 16 XMM registers vs 8 XMM registers

in 32-bit, there other advantages in working in 64-bit too.

FPU for x86 - it is not wise to use it anymore when there is SIMD. You can do everything with SIMD which you was

doing with FPU. Intel itself is recommending this.

That is right. It is my experience too.

There is one more way to speedup the overall performance - when you send the kernel to the GPU the code returns back to CPU execution and at this time GPU is running with CPU in parallel. If you manage to use the SIMD CPU during this time to feed up the next data ready to the GPU for next computation, you can get CPU-GPU working in parallel.