Generating SSE code for blocks.

Hi,

I am trying to generate good SSE code using the PGI compiler. I am running into issues. The compiler is refusing to generate SSE code for a block of code such as this.



  for (t4 = 0; t4 <= 14; t4++)
      {
        #pragma ivdep
//	#pragma vector aligned
        z[0] = z[0] + A[t4 * 15 + 0] * x[t4];
        z[0 + 1] = z[0 + 1] + A[t4 * 15 + 0 + 1] * x[t4];
        z[0 + 2] = z[0 + 2] + A[t4 * 15 + 0 + 2] * x[t4];
        z[0 + 3] = z[0 + 3] + A[t4 * 15 + 0 + 3] * x[t4];
        z[0 + 4] = z[0 + 4] + A[t4 * 15 + 0 + 4] * x[t4];
        z[0 + 5] = z[0 + 5] + A[t4 * 15 + 0 + 5] * x[t4];
        z[0 + 6] = z[0 + 6] + A[t4 * 15 + 0 + 6] * x[t4];
        z[0 + 7] = z[0 + 7] + A[t4 * 15 + 0 + 7] * x[t4];
        z[0 + 8] = z[0 + 8] + A[t4 * 15 + 0 + 8] * x[t4];
        z[0 + 9] = z[0 + 9] + A[t4 * 15 + 0 + 9] * x[t4];
        z[0 + 10] = z[0 + 10] + A[t4 * 15 + 0 + 10] * x[t4];
        z[0 + 11] = z[0 + 11] + A[t4 * 15 + 0 + 11] * x[t4];
        z[0 + 12] = z[0 + 12] + A[t4 * 15 + 0 + 12] * x[t4];
        z[0 + 13] = z[0 + 13] + A[t4 * 15 + 0 + 13] * x[t4];
        z[0 + 14] = z[0 + 14] + A[t4 * 15 + 0 + 14] * x[t4];
      }

I can re roll the entire block to form a loop. When I do this, the compiler unrolls the loop and vectorizes it but uses only two SSE registers which restricts the instruction level parallelism, Is there a way to get around this ? The block contains a lot of independent instructions perfect for SSE.

Thanks,
Shreyas

Hi Shreyas,

The above code wont vectorize due to the data dependency on Z. I’m assuming your re-rolled version contains the same dependency? An example would be helpful.

  • Mat

Hi, I ran into this problem too with PGI Compiler (i use the last Commnunity edition) but not with Intel Compiler.

Here is the code,
if there is true dependency with shorter vector length for loop running in pure SIMD? this loop is not quite good parallizable without big efforts. Commented out loops just to show the path used to transform.
for loop2 i have no shortsightnedness as compiler for fucntion V/(2*V+1) with vector length - 2.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define N   10000

int main( int argc, char* argv[] )
{
   int     i, j;

   double  *a ;

   a = (double *) malloc( N * sizeof( double ) ) ;
   if ( a == NULL )
      perror( "Memory allocation failed\n" ) ;

   a[0] = 2.0 ;

// loop1
/*   for( i = 1 ; i < N ; ++i )
   {
      a[i] = a[i-1] / ( a[i-1] + 1.0 ) ;
   }
*/

   a[1] = a[0] / ( a[0] + 1.0 ) ;

//loop2
/*
   #pragma nounroll
   for( i = 2 ; i < N-1 ; i+=2 )
   {
      a[i] = a[i-2] / ( 2.0 * a[i-2] + 1.0 ) ;
      a[i+1] = a[i-1] / ( 2.0 * a[i-1] + 1.0 ) ;
   }
*/

//loop3
   for( i = 2 ; i < N-1 ; i+=2 )
   {
      for( j = 0 ; j < 2; ++j )
      {
         a[i+j] = a[i+j-2] / ( 2.0 * a[i+j-2] + 1.0 ) ;
      }
   }

   printf( "Some values: %lg, %lg\n", a[5000], a[5001] ) ;

   return 0;
}

here is output fo PGI

pgcc -Manno -Minfo=all -O2 test.c -o test_pgi.exe
main:
40, Loop not vectorized: data dependency
Loop unrolled 4 times
Generated 2 prefetches in scalar loop
43, Loop unrolled 2 times (completely unrolled)

here is output for intel

icl /FAs /O2 /Qopt-report:1 /Qopt-report-phase:vec,loop test.c -o test_intl.exe
Begin optimization report for: main(int, char **)

Report from: Loop nest & Vector optimizations [loop, vec]


LOOP BEGIN at D:\Alexander\Documents\test\test.c(40,4)
remark #25439: unrolled with remainder by 2
remark #25456: Number of Array Refs Scalar Replaced In Loop: 4

LOOP BEGIN at D:\Alexander\Documents\test\test.c(43,7)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP END

LOOP BEGIN at D:\Alexander\Documents\test\test.c(40,4)

LOOP END

The question: when PGI Compiler will be able to run such loop with SIMD instruction*?
*upd: i meant vector SIMD, not scalar SIMD