Execute loop in routine sequentially

Hi,

in my Openacc accelerated program, I would like to have function calls which are processed sequentially by all threads. I found function calls are now possible with Openacc 2.0 which is supported by my compiler (pgcc from the 2014 PGI compilers). Now “#pragma acc routine seq” seemed to be the right solution, but with a loop inside the function, it does not work as I expected.

Here is an example code:

#pragma acc routine seq
void fill(int *buf) {
  for(int j=0; j<100; ++j) {
    buf[j] = j+5;
  }
}

int main() {
  int *buf = (int*) malloc(100*sizeof(int));
  memset(buf, 0, 100*sizeof(int));

  #pragma acc parallel loop copy(buf[0:100])
  for(int i=0; i<1; ++i) { // try it with 1 thread first
    fill(buf);
  }

  for(int z=0; z<100; ++z) {
    printf("%i ", buf[z]);
  }
}

The buffer is being initialized with zeros. As output, I get:

5,0,0,0,0,…

So it seems as if the thread is only writing on the first position and not really executing the for loop in the fill function multiple times. If I augment the number of threads, the result is the same (apparently all threads only write on the first position). By exchanging “fill(buf)” by fill(buf+i) the result supports this assumption:

5,5,5,0,0,0

What am I doing wrong here? Obviously, I would like to have 5,6,7,8,… as a result when using one thread. Even if the “seq” is probably wrong, how can I do it right?

Thanks in advance!
Marius

P.S. I know, the example-fill-function could easily be parallelized too. But I need it to be sequential since in the real code the loop iterations somehow depend on each other.

Hi Marius

Look to me to be a compiler error when generating the loop in the “fill” routine. If I change from a constant loop trip count to a variable, the code works as expected. I’ve added a problem report (TPR#20523) and sent it on to engineering for further investigation.

Here’s the work around:

% cat buf1.c
#include <stdlib.h>

#pragma acc routine seq
 void fill(int *buf, int size) {
   for(int j=0; j<size; ++j) {
     buf[j] = j+5;
   }
 }

 int main() {
   int *buf = (int*) malloc(100*sizeof(int));
   memset(buf, 0, 100*sizeof(int));

   #pragma acc parallel loop copy(buf[0:100])
   for(int i=0; i<1; ++i) { // try it with 1 thread first
     fill(buf,100);
   }

   for(int z=0; z<100; ++z) {
     printf("%i ", buf[z]);
   }
   printf("\n");
 }
% pgcc -Minfo=accel buf1.c -acc; a.out
fill:
      4, Generating acc routine seq
         Generating Tesla code
main:
     14, Generating copy(buf[:100])
         Accelerator kernel generated
         15, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
     14, Generating Tesla code
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104

Hope this helps,
Mat

Thanks a lot, Mat, works for me - even though it would be nicer without the workaround. The problem reports are not publicly available, aren’t they? I’d like to inform myself whether/when this gets fixed!

Marius

Hi Marius,

The fix for TPR#20523 will be in the 14.7 release.

The problem reports are not publicly available, aren’t they?

No, sorry. There are a number of technical issues which prevented us from doing this. Actually, the reason I started the PGI User Forums back in 2004 was because we couldn’t open our TPR system. If you want status on a specific TPR, please don’t hesitate to post or send a note to PGI Customer Service (trs@pgroup.com).

  • Mat

20523 - OpenACC: User code gets wrong answers when using a loop with fixed iteration count in routine seq

is fixed in the 14,7 release.


thanks,
dave