Generating SSE code for blocks.

alienanthill · July 11, 2011, 2:42pm

Hi,

I am trying to generate good SSE code using the PGI compiler. I am running into issues. The compiler is refusing to generate SSE code for a block of code such as this.

  for (t4 = 0; t4 <= 14; t4++)
      {
        #pragma ivdep
//	#pragma vector aligned
        z[0] = z[0] + A[t4 * 15 + 0] * x[t4];
        z[0 + 1] = z[0 + 1] + A[t4 * 15 + 0 + 1] * x[t4];
        z[0 + 2] = z[0 + 2] + A[t4 * 15 + 0 + 2] * x[t4];
        z[0 + 3] = z[0 + 3] + A[t4 * 15 + 0 + 3] * x[t4];
        z[0 + 4] = z[0 + 4] + A[t4 * 15 + 0 + 4] * x[t4];
        z[0 + 5] = z[0 + 5] + A[t4 * 15 + 0 + 5] * x[t4];
        z[0 + 6] = z[0 + 6] + A[t4 * 15 + 0 + 6] * x[t4];
        z[0 + 7] = z[0 + 7] + A[t4 * 15 + 0 + 7] * x[t4];
        z[0 + 8] = z[0 + 8] + A[t4 * 15 + 0 + 8] * x[t4];
        z[0 + 9] = z[0 + 9] + A[t4 * 15 + 0 + 9] * x[t4];
        z[0 + 10] = z[0 + 10] + A[t4 * 15 + 0 + 10] * x[t4];
        z[0 + 11] = z[0 + 11] + A[t4 * 15 + 0 + 11] * x[t4];
        z[0 + 12] = z[0 + 12] + A[t4 * 15 + 0 + 12] * x[t4];
        z[0 + 13] = z[0 + 13] + A[t4 * 15 + 0 + 13] * x[t4];
        z[0 + 14] = z[0 + 14] + A[t4 * 15 + 0 + 14] * x[t4];
      }

I can re roll the entire block to form a loop. When I do this, the compiler unrolls the loop and vectorizes it but uses only two SSE registers which restricts the instruction level parallelism, Is there a way to get around this ? The block contains a lot of independent instructions perfect for SSE.

Thanks,
Shreyas

MatColgrove · July 11, 2011, 5:42pm

Hi Shreyas,

The above code wont vectorize due to the data dependency on Z. I’m assuming your re-rolled version contains the same dependency? An example would be helpful.

Mat

alex_96 · May 21, 2017, 5:58pm

Hi, I ran into this problem too with PGI Compiler (i use the last Commnunity edition) but not with Intel Compiler.

Here is the code,
if there is true dependency with shorter vector length for loop running in pure SIMD? this loop is not quite good parallizable without big efforts. Commented out loops just to show the path used to transform.
for loop2 i have no shortsightnedness as compiler for fucntion V/(2*V+1) with vector length - 2.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define N   10000

int main( int argc, char* argv[] )
{
   int     i, j;

   double  *a ;

   a = (double *) malloc( N * sizeof( double ) ) ;
   if ( a == NULL )
      perror( "Memory allocation failed\n" ) ;

   a[0] = 2.0 ;

// loop1
/*   for( i = 1 ; i < N ; ++i )
   {
      a[i] = a[i-1] / ( a[i-1] + 1.0 ) ;
   }
*/

   a[1] = a[0] / ( a[0] + 1.0 ) ;

//loop2
/*
   #pragma nounroll
   for( i = 2 ; i < N-1 ; i+=2 )
   {
      a[i] = a[i-2] / ( 2.0 * a[i-2] + 1.0 ) ;
      a[i+1] = a[i-1] / ( 2.0 * a[i-1] + 1.0 ) ;
   }
*/

//loop3
   for( i = 2 ; i < N-1 ; i+=2 )
   {
      for( j = 0 ; j < 2; ++j )
      {
         a[i+j] = a[i+j-2] / ( 2.0 * a[i+j-2] + 1.0 ) ;
      }
   }

   printf( "Some values: %lg, %lg\n", a[5000], a[5001] ) ;

   return 0;
}

here is output fo PGI

pgcc -Manno -Minfo=all -O2 test.c -o test_pgi.exe
main:
40, Loop not vectorized: data dependency
Loop unrolled 4 times
Generated 2 prefetches in scalar loop
43, Loop unrolled 2 times (completely unrolled)

here is output for intel

icl /FAs /O2 /Qopt-report:1 /Qopt-report-phase:vec,loop test.c -o test_intl.exe
Begin optimization report for: main(int, char **)

Report from: Loop nest & Vector optimizations [loop, vec]

LOOP BEGIN at D:\Alexander\Documents\test\test.c(40,4)
remark #25439: unrolled with remainder by 2
remark #25456: Number of Array Refs Scalar Replaced In Loop: 4

LOOP BEGIN at D:\Alexander\Documents\test\test.c(43,7)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP END

LOOP BEGIN at D:\Alexander\Documents\test\test.c(40,4)

LOOP END

The question: when PGI Compiler will be able to run such loop with SIMD instruction*?
*upd: i meant vector SIMD, not scalar SIMD

Topic		Replies	Views
Is there a way to vectorize this routine? Legacy PGI Compilers	6	48284	October 9, 2007
Decide on wheter parallelize or unroll a loop Legacy PGI Compilers	3	2397	November 5, 2015
unrolling or data dependent loops Legacy PGI Compilers	10	6630	March 11, 2013
-Mvect=sse Legacy PGI Compilers	2	3271	November 2, 2010
SSE optimization problems... Legacy PGI Compilers	3	4755	March 24, 2009
Code dependencies Legacy PGI Compilers	3	2833	February 14, 2013
New facet Legacy PGI Compilers	1	1975	October 4, 2012
Force a loop to vectorize Legacy PGI Compilers	6	4400	July 26, 2022
Compile error and wrong diagnosis of loop carried dependence Legacy PGI Compilers	3	3048	February 3, 2012
Loop "too deeply nested" and "data dependency Legacy PGI Compilers	9	10614	November 27, 2017

Generating SSE code for blocks.

Related topics