CUDA vs Intel IPP in signal processing

escape15 · July 19, 2009, 8:48am

Hi Everybody. I’m trying to use CUDA in filtering a digital signal. Here is the formula:

x[n] is the input signal,

y[n] is the output signal, and

b[sub]i[/sub] are the filter coefficients.

The number of coefficients is 512, and 1048576 for the input signal.

Here is the Code on c++:

for (int i=n_inf-1; i>=0; i--)

	{

		outp[i]=0;

		if (i<n_koef_ch) n_koef_ch=i+1;

		if (n_koef_ch==0) n_koef_ch=1;

		for (int j=0;j<n_koef_ch;j++)

		{

			outp[i]=outp[i]+inf[i-j]*koef[j];

		}

	}

The Code on Cuda(all the coefficients are in the constant memory, in dc_koef array):

...

dim3 threads (n_koef);

dim3 blocks  ( n_inf/threads.x );

mult<<<blocks, threads>>>(d_inf,d_outp,n_koef,n_inf);

...

__global__ void mult(float* d_inf, float* d_outp, int n_koef, int n_inf)

{

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	__shared__ float as [1024];

	as[threadIdx.x]=d_inf[idx];										

	as[n_koef+threadIdx.x]=d_inf[n_koef+idx];

	__syncthreads();												

sum=0.0f;

	for(int i=0;i<n_koef;i++)

	{

	sum+=as[threadIdx.x+i]*dc_koef[i];							

	}

	d_outp[n_inf-1-idx]=sum;

}

And using IPP signal processing library

__host__ void ippfilt(int n_inf, int n_koef, float* inf, float* koef, float* outp)

{

	IppStatus status;

	IppsFIRState_32f *fctx;

	ippsFIRInitAlloc_32f( &fctx, koef, n_koef, NULL );

	status=ippsFIR_32f(inf, outp, n_inf, fctx);

}

Hardware: IntÐµl CorÐµâ„¢2 Duo (Ð•4500) 2.2 Ghz; GeForce 9600GT Palit Sonic

CUDA function runs 43 times faster then the code on c++, but about 6 times slower then when using IPP library.

Please, tell me, what i’m doing wrong?

eyalhir74 · July 19, 2009, 8:53am

Hi,
whats the dc_koef?

jack · July 19, 2009, 4:44pm

I don’t know if the IPP library is taking advantage of both cores on your CPU, but it is almost certainly taking advantage of the SIMD instructions to get a big increase in speed.

Also, what are your actual speeds for computing the result with each method (in milliseconds, or whatever)?

escape15 · July 19, 2009, 6:42pm

eyalhir74:

profquail: my version of IPP is using only one core.

Actual times:

~4s for processor;

40ms for GPU;

10ms for IPP;

Smokey · July 19, 2009, 10:56pm

From personal experience, attempting to make a CUDA kernel that out performs IPP on small datasets is very, VERY difficult. In fact I’m yet to really beat IPP in all cases with ANY of our CUDA kernels (except IPPI, in which case GPUs get a pretty massive performance advantage with texture sampling & caching hardware).

If you’re dealing with datasets smaller than say 1-2MB, or if your CPU code tends to take a matter of milli/micro seconds (eg: not seconds) - you’re seriously better off rolling your own heavily specialized SSE2 (and 3/4.x if it applies to you) code, single core - PCI Latency & CUDA kernel invocation overhead will kill your performance (in my experience) for such cases.

In many cases we’ve been able to get a 2-10x perfomance increase over IPP with our own CPU SIMD code (depending on function).

jma · July 20, 2009, 6:41am

There is something fishy about the timings. The job to be done is 512 million madd, right? An E4500 @2.2GHz using both cores and optimized for sse can bruteforce an FIR like that in perhaps 30 ms. If IPP does it in only 10 ms, then I would guess that there is an FFT involved somewhere.

Also, the single threaded version ought to execute approximately 8x slower, but your’s is more than 100x slower … Do you have a lot of denormals in your input data?

_Big_Mac · July 20, 2009, 12:43pm

If the C++ compiler targets the x86’s floating point unit, it WILL be much slower on Intel CPUs. I’ve done benchmarks for this, had non-parallelizable a loop doing flops and compiled it with and without SSE. The SSE version did not parallelize computations, it went through single floats one by one (wasting 3/4 of the vector) and yet it was IIRC over 15x faster then the non-SSE version. I’ve inspected assemblies to check if there weren’t any shenanigans but found none.

I’ve tested it on something around 12 CPUs, AMDs and Intels. AMDs don’t suffer from this (practically no difference between SSE and non-SSE on non-parallelizable flop loops). Each Intel I’ve tested (from Pentium IV to 45nm dualcores) had a major performance issue with x86’s FPU. Perhaps it had denormals, this benchmark of mine.

From thereon I always check the “use sse2 extensions” options in the compiler.

Now I can see how this 15x times 2x cores times 4x vector width can potentially yield stupid amounts of speedups. The best I’ve achieved in a real application was 45x: OpenMP on two cores + SSE2 enabled in compiler versus naive looping gave 200ms vs 9 seconds (CUDA took 13ms by the way). This app didn’t have denormals.

jma · July 22, 2009, 2:31pm

I couldn’t help taking a closer look at the problem, letting ICC set a baseline for what an expected level of performance could look like. Except for that the order of coefficients have been inversed (so that both arrays are traversed in the same direction) and the dataset is 10 times bigger (to get some accuracy in time measures) this variant should be equivalent to the original source. On an E1200 @1.2GHz - with SSE and both cores running - I clock in at from 3.6 seconds down to 2.8 seconds for the full dataset (worth 10 Gflop) depending on how close to realtime I am aiming at (128 - 1024 samples latency), which translates to 360 downto 280 ms compared to the timings mentioned in previous posts. Although the coefficients along with the current chunk of data I/O fits nicely in cache L1, this is still 5x more than what one would have expected for double precision by blindly staring at peak arithmetic performance only:

[codebox][font=“Lucida Console”]

/** brute_fir.c

```
      P
```
y[n] = Z b[i] * x[n+i - P]
```
     i=0
```

icc -parallel -O2 -msse3 -march=core2 -o brute_fir_icc brute_fir.c

*/

typedef double fp_t;

define P (512)

define BUFSZ (128) /* 1.3 ms @ 96KHz samplerate */

void brute_P(fp_t *y,const fp_t *x,const fp_t *B)

{

int i, n;

for(n = 0;n < BUFSZ;n++)

{

  y[n] = 0;

  for(i = 0; i < P; i++)

y[n] +=  b[i] * x[n+i];

}

}

fp_t B[P] ; // coefficients.

fp_t X[P+BUFSZ]; // input

fp_t Y[BUFSZ]; // output

include [/font][font=“Lucida Console”]

int main(void)

{

/* I need to use both stick and carrot to get

this realtime kernel to understand that it
should pay at least some attention to the
application. This may or may not apply to
your setup.

*/

static struct sched_param schp;

schp.sched_priority = 80; // = 40;

sched_setscheduler(0, SCHED_FIFO, &schp);

usleep(10000); // yield for system drivers …

int i;

// set up coefficients (in reverse order)

i = P;

while(i–)

X[i] = 0, B[i] = i*0.00001f;

/* A 1988 CRAY 2 could not have completed this

loop in less than 5 seconds. An E1200 from
2008 will beat that by at least 30%.
Your saving: $25 million and 200 KW of power :-)

*/

i=10 * (1048576/BUFSZ);

while(i–)

{

  int j;

  // keep the previos last P input

  for(j=0;j < P;j++)

X[j] = X[BUFSZ+j];

// fetch new input

  for(j=0;j < BUFSZ;j++)

X[P+j] = i*0.00001f;

usleep(4); // system time …

  brute_P( X , Y , B );

}

return 0;

}[/font]

[/codebox]

Surprisingly there is no advantage in switching to 32 bit floats … unless you cheat the permute unit and fake an easier access pattern where everything aligns almost perfectly, in which case the thruput is doubled.

The question now is: Can CUDA shared memory and/or texture cache give enough permute power to beat ICC? By how much? For comparison, an E1200 has approximately the same bandwidth, cache and peak performance as a single CUDA multi-processor, so by escape15’s example it has been shown that it is at least possible to do just as well - relative to available resources, that is.

…

thstart · July 22, 2009, 10:27pm

It also depend if you generate a code for 64-bit platform. For 64-bit you get 16 XMM registers vs 8 XMM registers

in 32-bit, there other advantages in working in 64-bit too.

FPU for x86 - it is not wise to use it anymore when there is SIMD. You can do everything with SIMD which you was

doing with FPU. Intel itself is recommending this.

thstart · July 22, 2009, 10:32pm

That is right. It is my experience too.

There is one more way to speedup the overall performance - when you send the kernel to the GPU the code returns back to CPU execution and at this time GPU is running with CPU in parallel. If you manage to use the SIMD CPU during this time to feed up the next data ready to the GPU for next computation, you can get CPU-GPU working in parallel.

bgw42777 · May 22, 2024, 11:57am

I have same result like above test.
I binarized image that size is 8392 x 8192 with IPP 7.0(old ver) and Cuda 8.0(old ver).
Time is 9msec(IPP) and 434~140msec(Cuda). Cuda result is variable by thread num.
Firstly I think Cuda is faster than Ipp.

Topic		Replies	Views
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12113	February 10, 2010
Phenomenal Speed-up! CUDA Programming and Performance	13	10627	November 13, 2009
What are you guys doing with cuda? just wanna find a way to go CUDA Programming and Performance	81	56090	February 7, 2013
Cuda code performance CUDA Programming and Performance	14	3144	December 16, 2014
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
Mythical Tflops CUDA Programming and Performance	11	1111	January 14, 2019
Is CUDA really that fast? CUDA Programming and Performance	17	11711	September 21, 2009
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8149	February 17, 2015
CUDA very slow performance CUDA Programming and Performance	21	16709	March 6, 2020
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13384	July 9, 2008

CUDA vs Intel IPP in signal processing

Related topics