Vectorization intrinsics are not giving expected performance

Hi All,
I want to know the impact of vectorization on jetson TK1 board. So I have written a program which is addition of two arrays.
The code is given below.

#include <stdio.h>
#include<stdlib.h>
#include <time.h>
#include <sys/time.h>
#define NEON

#ifdef NEON
#include "arm_neon.h"
#endif

#include <omp.h>

int main()
{
int N = 1024*1024*10;
float* a = (float*)malloc(sizeof(float)*N);
float* b = (float*)malloc(sizeof(float)*N);
float* c = (float*)malloc(sizeof(float)*N);

#ifdef NEON

printf("neon enabled\n");

#else
printf("neon not enabled\n");
#endif

for(int i=0; i < N; i ++)
{
b[i] = i;
c[i] = i;

}

#ifdef NEON

float *src1 = &b[0];
float *src2 = &c[0];
float *dst = &a[0];
float32x4_t src1_reg;
float32x4_t src2_reg;
float32x4_t dst_reg;
#endif

#ifdef NEON
struct timeval start, end;
gettimeofday(&start, NULL);
for(int index=0;index<N;index+=4){
src1_reg = vld1q_f32(src1);
src2_reg = vld1q_f32(src2);
dst_reg = vaddq_f32(src1_reg,src2_reg);
vst1q_f32(dst, dst_reg);
src1 = src1+4;
src2 = src2+4;
dst = dst + 4;

}

gettimeofday(&end, NULL);
  printf("neon intrinsics time is = %ld\n", ((end.tv_sec * 1000000 + end.tv_usec)
		  - (start.tv_sec * 1000000 + start.tv_usec)));
#else
struct timeval start, end;
gettimeofday(&start, NULL);
for(int i =0; i < N; i++)
{
a[i] = b[i] + c[i];
}

gettimeofday(&end, NULL);
  printf("CPU time is = %ld\n", ((end.tv_sec * 1000000 + end.tv_usec)
		  - (start.tv_sec * 1000000 + start.tv_usec)));
#endif

float temp =0;

 for(int i=0; i < N; i ++)
{
temp = temp + a[i];
}
printf("a = %f\n",temp);
}

I have compiled it using g++ -std=c++0x -O3 -mfpu=neon -ftree-vectorizer-verbose=2 -fopenmp -fpermissive -fno-trapping-math test.cpp -o test (enabled vectorization) when i enabled NEON in line number 5 in code.

I have compiled it using g++ -std=c++0x -O3 -fno-tree-vectorize -fopenmp -fpermissive -fno-trapping-math test.cpp -o test(disabled vectorization) when NEON flag is disabled in line number 5 in code.

For the first case the execution time is ~50msec, Second case is taking ~55msecs.
I am wondering why there is not enough speed (< 10% speed) when vectorization is enabled. Can some one tell me what could be the problem.

Thanks
sivaramakrishna

Vectorizing is listed as default with -O3, and although one case has -fno-tree-vectorize, I wonder if the -O3 is vectorizing anyway. What happens if on the -fno-tree-vectorize you also use option -ftree-vectorizer-verbose=2? Assuming -ftree-vectorizer-verbose=2 is just displaying vectorizing information during compile, and that this does not cause vectorizing to be enabled, you should be able to confirm that -fno-tree-vectorize during -O3 is actually working as expected (but perhaps -O3 overrides this and vectorizes anyway).

I also see an extended list of test cases which might provide more information:
https://gcc.gnu.org/projects/tree-ssa/vectorization.html#vectorizab

Hi,
I agree that vectorization is enabled by default in O3 . BUt when I use -fno-tree-vectorize and -ftree-vectorizer-verbose=2 along with O3 during compilation, It does not give any information regarding the vectorization. So I assume vectorization is disabled. If I remove -fno-tree-vectorize, I can see vectorization information while compiling.
How can O3 overrides fno-tree-vectorize flag and enable vectorization. I think we have to use -fno-tree-vectorize flag to disable just vectorization from list of optimization done in O3.
I want to know is there any mistake in the code or compiling flags?

50 msecs is a short amount of time. Try running it much longer and see if you a get a different result.

Also, have you maximised all (CPU, EMC, GPU) clocks before running any benchmarks?

http://elinux.org/Jetson/Performance#Maximizing_CPU_performance

Hi all,

It seems if i just do one addition operation for one memory operation, vectorization does not help much. When I do multiple repetitive addition operations on registers, I am able to see the impact of the vectorization.
So it true for memory bound applications, vectorization does not help much?

Thanks
Siva Rama Krishna

The use of multiple simultaneous instructions requires being able to know ahead of time that a predictable number of operations apply without each operation needing to know what the prior operation did. If you don’t have multiple operations, then vectorization on something like SIMD or NEON won’t be of benefit. SIMD and NEON will have limits as to how many operations can occur at once, but in general increasing the number of operations up to that limit should help performance. There is always overhead with transferring information back and forth with something like NEON, so it may not show any benefit to vectorize tiny loops. Maximum benefit would occur when operation counts approach your NEON or SIMD multiple instruction count limits.