Strange Failure to fill values?

I’m having this very strange result from the following function that I can’t seem to figure out. I’m sorry that I haven’t been able to get function into anything runnable for anyone but a select few, but I can’t seem to replicate the problem with anything but this code.

static void fn_1_1ii(A*z,A*l,A*r,A*penv[]){
A env0[1];A*env[]={env0,penv[0]};
DO(i,1)env0[i].v=NULL;
{B zc=1,rc=1,lc=1;
A *rslt=&env[0][0];A *rgt=r;A *lft=l;
#pragma acc enter data copyin(zc)
I rr=rgt->r;I lr=lft->r;
B*restrict rs=rgt->s;B*restrict ls=lft->s;
aplint32 *restrict rv=(rgt)->v;aplint32 *restrict lv=(lft)->v;
I zr;B zs[15];
if(rr!=0&&lr!=0&&abs(rr-lr)>1)error(4);int minr=rr>lr?lr:rr;
if(lr==rr&&rr>0){I n=rr-1;DO(i,n)if(rs[i+1]!=ls[i+1])error(5);}
else if(lr<rr){DO(i,lr)if(ls[i]!=rs[i+1])error(5);}
else{DO(i,rr)if(ls[i+1]!=rs[i])error(5);}
zs[0]=1;if(lr>rr){zr=lr;DO(i,lr)zs[i]=ls[i];}
else{zr=rr;DO(i,rr)zs[i]=rs[i];}
zr=zr==0?1:zr;zs[0]+=minr==zr?ls[0]:1;
ai(rslt,zr,zs,1);
aplint32 *restrict zv=(rslt)->v;
DO(i,zr)zc*=zs[i];DO(i,lr)lc*=ls[i];DO(i,rr)rc*=rs[i];
I zcp=zc;I rcp=rc;I lcp=lc;
I lt=lr!=0;I rt=rr!=0;zc/=zc==0?1:zs[0];rc=rr==0?zc:rc;lc=lr==0?zc:lc;
#pragma acc kernels loop present(zv[:zcp],lv[:lcp])
DO(i,lc){I lvi=lt*i;zv[i]=lv[lvi];}
#pragma acc kernels loop independent present(zv[:zcp],rv[:rcp])
DO(i,rc){I zvi=lc+i;I rvi=rt*i;zv[zvi]=rv[rvi];}
#pragma acc exit data delete(zc)
}
cpaa(z,&env[0][0]);
fe(&env0[1],0);}

Basically, this is code that’s doing a copy of two arrays into a third, and the main concern is the second kernels loop. The first kernels loop seems to work just fine, and works all the time. However, sometimes, particularly in some cases when rc > lc, I see that the loop sometimes doesn’t seem to iterate over the whole space, and instead only copies in part of the data.

I haven’t been able to figure out why this happens. It happens with small and larger inputs. But with some of the inputs, I’ve been able to run the test manually and instead of using DO(i,rc), instead use DO(i,) where is the literal number of iterations I want, and it works, for values as small as 12 and as large as 100.

I have confirmed that before entry into the second loop that the rc value is right. The DO macro above is just:

#define DO(i,n) for(long long int i = 0; i < (n); i++)

I can also confirm that the code works without OpenACC and on other compilers, so the algorithm is correct in so far as it goes.

Here’s an example test failure with the expected correct values and the incorrect values:

Expected
    
    ¯29 ¯28 ¯27 ¯26 ¯25 ¯24 ¯23 ¯22 ¯21 ¯20
    ¯19 ¯18 ¯17 ¯16 ¯15 ¯14 ¯13 ¯12 ¯11 ¯10
     ¯9  ¯8  ¯7  ¯6  ¯5  ¯4  ¯3  ¯2  ¯1   0
     99  98  97  96  95  94  93  92  91  90
     89  88  87  86  85  84  83  82  81  80
     79  78  77  76  75  74  73  72  71  70
     69  68  67  66  65  64  63  62  61  60
     59  58  57  56  55  54  53  52  51  50
     49  48  47  46  45  44  43  42  41  40
     39  38  37  36  35  34  33  32  31  30
     29  28  27  26  25  24  23  22  21  20
     19  18  17  16  15  14  13  12  11  10
      9   8   7   6   5   4   3   2   1   0
    ~
Got
    
    ¯29 ¯28 ¯27 ¯26 ¯25 ¯24 ¯23 ¯22 ¯21 ¯20
    ¯19 ¯18 ¯17 ¯16 ¯15 ¯14 ¯13 ¯12 ¯11 ¯10
     ¯9  ¯8  ¯7  ¯6  ¯5  ¯4  ¯3  ¯2  ¯1   0
     99  98  97  96  95  94  93  92  91  90
     89  88  87  86  85  84  83  82  81  80
     79  78  77  76  75  74  73  72  71  70
     39  38  37  36  35  34  33  32  31  30
     29  28  27  26  25  24  23  22  21  20
     19  18  17  16  15  14  13  12  11  10
      9   8   7   6   5   4   3   2   1   0
    ¯29 ¯28 ¯27 ¯26 ¯25 ¯24 ¯23 ¯22 ¯21 ¯20
    ¯19 ¯18 ¯17 ¯16 ¯15 ¯14 ¯13 ¯12 ¯11 ¯10
     ¯9  ¯8  ¯7  ¯6  ¯5  ¯4  ¯3  ¯2  ¯1   0
    ~

I can see here where it almost seems like the code is overwriting the same region of memory multiple times, though I can’t account for the extra values at the bottom.

Hi Aaron,

#pragma acc enter data copyin(zc)

Why are you explicitly copying “zc” to the device?

It should be extraneous since even though you copy it over to the device, you don’t actually use it. But if you’re missing some code in the snip-it and were actually using it on the device without adding update directives, this would be a problem.

For example, if the “DO(i,zr)zc*=zs_;” loop was compute on the device, the when the zc value is used on the host, “zc/=zc==0?1:zs[0];lc=lr==0?zc:lc”, then you’d be using the original host value, not the updated value.

Also, if rslt, zr, or zs are updated on the device in the call to “ai” and not synchronized with host, again you’d be using different values.

My best guess is that the “zvi” and/or “rvi” indices are not being computed correctly based on a device/host synchronization issue with one of your variables. You can test this theory but turning off all OpenACC directives except these two kernel loops and changing “present” to “copy”.

Hope this helps,
Mat_

Thanks for the thoughts. I finally got around to eliminating that spurious zc copy, and, as expected, that doesn’t fix anything. The zc is only used on the cpu as you can see, since I haven’t ellided any of the code in between on this function. However, it was a good motivation to eliminate that ugly piece of code.

The reason this code is so hard to debug is that it is auto-generated as part of a shared object that relies on third-party software in order to function properly, so it’s very hard to isolate it out of that environment.

Here’s the ai() function:

V aa(A*a,I tp){frea(a);B c=1;DO(i,a->r)c*=a->s[i];B z=0;
 B pc=8*ceil(c/8.0);
 switch(tp){
  case 1:z=sizeof(I)*pc;break;
  case 2:z=sizeof(D)*pc;break;
  case 3:z=ceil((sizeof(U8)*pc)/8.0);break;
  default: error(16);}
 z=4*ceil(z/4.0);char*v=malloc(z);if(NULL==v)error(1);
 #ifdef _OPENACC
  #pragma acc enter data create(v[:z])
 #endif
 a->v=v;a->z=z;a->c=c;a->f=2;}
V ai(A*a,I r,B *s,I tp){a->r=r;DO(i,r)a->s[i]=s[i];aa(a,tp);}

It’s a fairly straightforward allocation and simply makes sure that the array allocated is also allocated on the GPU.

I was also thinking that zvi and rvi must be miscalculated somehow, but here are a few examples of codes and outputs that make me think otherwise:

static void fn_1_1ii(A*z,A*l,A*r,A*penv[]){
A env0[1];A*env[]={env0,penv[0]};
DO(i,1)env0[i].v=NULL;
{B zc=1,rc=1,lc=1;
A *rslt=&env[0][0];A *rgt=r;A *lft=l;
I rr=rgt->r;I lr=lft->r;
B*restrict rs=rgt->s;B*restrict ls=lft->s;
aplint32 *restrict rv=(rgt)->v;aplint32 *restrict lv=(lft)->v;
I zr;B zs[15];
if(rr!=0&&lr!=0&&abs(rr-lr)>1)error(4);int minr=rr>lr?lr:rr;
if(lr==rr&&rr>0){I n=rr-1;DO(i,n)if(rs[i+1]!=ls[i+1])error(5);}
else if(lr<rr){DO(i,lr)if(ls[i]!=rs[i+1])error(5);}
else{DO(i,rr)if(ls[i+1]!=rs[i])error(5);}
zs[0]=1;if(lr>rr){zr=lr;DO(i,lr)zs[i]=ls[i];}
else{zr=rr;DO(i,rr)zs[i]=rs[i];}
zr=zr==0?1:zr;zs[0]+=minr==zr?ls[0]:1;
ai(rslt,zr,zs,1);
aplint32 *restrict zv=(rslt)->v;
DO(i,zr)zc*=zs[i];DO(i,lr)lc*=ls[i];DO(i,rr)rc*=rs[i];
I zcp=zc;I rcp=rc;I lcp=lc;
I lt=lr!=0;I rt=rr!=0;zc/=zc==0?1:zs[0];rc=rr==0?zc:rc;lc=lr==0?zc:lc;
#pragma acc kernels loop present(zv[:zcp],lv[:lcp])
DO(i,lc){I lvi=lt*i;zv[i]=lv[lvi];}
#pragma acc kernels loop independent present(zv[:zcp],rv[:rcp])
DO(i,100){I zvi=lc+i;I rvi=rt*i;zv[i]=i;}
}
cpaa(z,&env[0][0]);
fe(&env0[1],0);}

Notice here that I specifically avoid using zvi or rvi, and I also use 100 instead of rc as the loop bounds, and this is the result I get:

    
     0   1  2  3  4   5  6   7  8   9
    10  11 12 13 14  15 16  17 18  19
    20  21 22 23 24  25 26  27 28  29
    30  31 32 33 34  35 36  37 38  39
    40  41 42 43 44  45 46  47 48  49
    50  51 52 53 54  55 56  57 58  59
    60  61 62 63 64  65 66  67 68  69
    70  71 72 73 74  75 76  77 78  79
    80  81 82 83 84  85 86  87 88  89
    90  91 92 93 94  95 96  97 98  99
     0 585 ¯1  0  0   0  0   0  0 586
    ¯1   0  0  0  0   0  0 587 ¯1   0
     0   0  0  0  0 588 ¯1   0  0   0
    ~

In fact, this is exactly what I expect to get in this case. But now, let’s use rc instead of using 100, but still avoid using zvi or rvi:

static void fn_1_1ii(A*z,A*l,A*r,A*penv[]){
A env0[1];A*env[]={env0,penv[0]};
DO(i,1)env0[i].v=NULL;
{B zc=1,rc=1,lc=1;
A *rslt=&env[0][0];A *rgt=r;A *lft=l;
I rr=rgt->r;I lr=lft->r;
B*restrict rs=rgt->s;B*restrict ls=lft->s;
aplint32 *restrict rv=(rgt)->v;aplint32 *restrict lv=(lft)->v;
I zr;B zs[15];
if(rr!=0&&lr!=0&&abs(rr-lr)>1)error(4);int minr=rr>lr?lr:rr;
if(lr==rr&&rr>0){I n=rr-1;DO(i,n)if(rs[i+1]!=ls[i+1])error(5);}
else if(lr<rr){DO(i,lr)if(ls[i]!=rs[i+1])error(5);}
else{DO(i,rr)if(ls[i+1]!=rs[i])error(5);}
zs[0]=1;if(lr>rr){zr=lr;DO(i,lr)zs[i]=ls[i];}
else{zr=rr;DO(i,rr)zs[i]=rs[i];}
zr=zr==0?1:zr;zs[0]+=minr==zr?ls[0]:1;
ai(rslt,zr,zs,1);
aplint32 *restrict zv=(rslt)->v;
DO(i,zr)zc*=zs[i];DO(i,lr)lc*=ls[i];DO(i,rr)rc*=rs[i];
I zcp=zc;I rcp=rc;I lcp=lc;
I lt=lr!=0;I rt=rr!=0;zc/=zc==0?1:zs[0];rc=rr==0?zc:rc;lc=lr==0?zc:lc;
#pragma acc kernels loop present(zv[:zcp],lv[:lcp])
DO(i,lc){I lvi=lt*i;zv[i]=lv[lvi];}
#pragma acc kernels loop independent present(zv[:zcp],rv[:rcp])
DO(i,rc){I zvi=lc+i;I rvi=rt*i;zv[i]=i;}
}
cpaa(z,&env[0][0]);
fe(&env0[1],0);}

And the results:

    
              0           1           2           3           4           5           6           7           8           9
             10          11          12          13          14          15          16          17          18          19
             20          21          22          23          24          25          26          27          28          29
           4096           4           0           0           0 ¯2147483648           2           0           0           0
           4096           0           0           0         128         512    33554432           0           0           0
              0           0   268435970           0        8192           0 ¯2147483648          64           0           0
         262160           0           4       40960 ¯1842182637   109229574  ¯725417836  1280774144   654939278 ¯1802203980
     ¯828599784 ¯2042358018 ¯1509113240 ¯1940649856   900891718 ¯1501260518  2122317854 ¯1936320362  1658195078  1554025536
    ¯1827367295  ¯496615147  1117130880 ¯1836707308   171098672 ¯2100157438    76284120  ¯690872176 ¯2105311088   504763946
       43253919   310542996 ¯2146397028 ¯1845324535 ¯1158540796 ¯1871507356 ¯1634434538  ¯795605500  1079871234    10621954
       67245062 ¯1809809354   129118303   831689728   525107892   412891584 ¯1802349405   395326976 ¯2105126256   403047600
     ¯805006714 ¯1778301664     1475278    51643012 ¯1852682106 ¯1013807488 ¯1744371518  1410334852   153518858   817108615
      143006712      190725 ¯2138943280   126919186 ¯1254951292 ¯1937334510 ¯1106865636  1141342228           1           0
    ~

And I tried to see what would happen if I copy instead of using present:

static void fn_1_1ii(A*z,A*l,A*r,A*penv[]){
A env0[1];A*env[]={env0,penv[0]};
DO(i,1)env0[i].v=NULL;
{B zc=1,rc=1,lc=1;
A *rslt=&env[0][0];A *rgt=r;A *lft=l;
I rr=rgt->r;I lr=lft->r;
B*restrict rs=rgt->s;B*restrict ls=lft->s;
aplint32 *restrict rv=(rgt)->v;aplint32 *restrict lv=(lft)->v;
I zr;B zs[15];
if(rr!=0&&lr!=0&&abs(rr-lr)>1)error(4);int minr=rr>lr?lr:rr;
if(lr==rr&&rr>0){I n=rr-1;DO(i,n)if(rs[i+1]!=ls[i+1])error(5);}
else if(lr<rr){DO(i,lr)if(ls[i]!=rs[i+1])error(5);}
else{DO(i,rr)if(ls[i+1]!=rs[i])error(5);}
zs[0]=1;if(lr>rr){zr=lr;DO(i,lr)zs[i]=ls[i];}
else{zr=rr;DO(i,rr)zs[i]=rs[i];}
zr=zr==0?1:zr;zs[0]+=minr==zr?ls[0]:1;
ai(rslt,zr,zs,1);
aplint32 *restrict zv=(rslt)->v;
DO(i,zr)zc*=zs[i];DO(i,lr)lc*=ls[i];DO(i,rr)rc*=rs[i];
I zcp=zc;I rcp=rc;I lcp=lc;
I lt=lr!=0;I rt=rr!=0;zc/=zc==0?1:zs[0];rc=rr==0?zc:rc;lc=lr==0?zc:lc;
#pragma acc update host(zv[:zcp],rv[:rcp],lv[:lcp])
#pragma acc kernels loop copy(zv[:zcp],lv[:lcp])
DO(i,lc){I lvi=lt*i;zv[i]=lv[lvi];}
#pragma acc kernels loop independent copy(zv[:zcp],rv[:rcp])
DO(i,rc){I zvi=lc+i;I rvi=rt*i;zv[i]=i;}
#pragma acc update device(zv[:zcp])

}
cpaa(z,&env[0][0]);
fe(&env0[1],0);}

Notice in the above that I’m still avoiding the use of zvi and rvi, and I’m just using zv_=i; Here’s what I get:_

    
              0           1           2           3           4           5           6           7           8           9
             10          11          12          13          14          15          16          17          18          19
             20          21          22          23          24          25          26          27          28          29
           4096           4           0           0           0 ¯2147483648           2           0           0           0
           4096           0           0           0         128         512    33554432           0           0           0
              0           0   268435970           0        8192           0 ¯2147483648          64           0           0
         262160           0           4       40960 ¯1842182637   109229574  ¯725417836  1280774144   654939278 ¯1802203980
     ¯828599784 ¯2042358018 ¯1509113240 ¯1940649856   900891718 ¯1501260518  2122317854 ¯1936320362  1658195078  1554025536
    ¯1827367295  ¯496615147  1117130880 ¯1836707308   171098672 ¯2100157438    76284120  ¯690872176 ¯2105311088   504763946
       43253919   310542996 ¯2146397028 ¯1845324535 ¯1158540796 ¯1871507356 ¯1634434538  ¯795605500  1079871234    10621954
       67245062 ¯1809809354   129118303   831689728   525107892   412891584 ¯1802349405   395326976 ¯2105126256   403047600
     ¯805006714 ¯1778301664     1475278    51643012 ¯1852682106 ¯1013807488 ¯1744371518  1410334852   153518858   817108615
      143006712      190725 ¯2138943280   126919186 ¯1254951292 ¯1937334510 ¯1106865636  1141342228           1           0
    ~

I’m not sure why the iteration appears to consistently stop at 30 here (which happens to be the same size as lc). There seems to be a somewhat consistent pattern here that the second loop only iterates the same as the first loop. That is, if the first loop iterates 6, and the second loop should iterate 12, then the second loop only actually fills in 6 values, despite rc == 12. This same thing happens in each case, where we only fill lc number of slots instead of rc number of slots for the second iteration.

That’s actually a very consistent pattern now that I look at it, and seems to be at the heart of the issue.

I have also tried to reorder the loops, but the result has been the same.

Here are the gpu kernels for the second loop that are being generated:

extern "C" __global__ __launch_bounds__(128) void
fn_1_1ii_179_gpu(
    signed char* p2/* zv */)
{
long long _k8_1;
long long i18s;
long long i19i;
i18s = 0LL;
_BB_8: ;
i19i = ((long long)((((int)blockIdx.x)*(128))+((int)threadIdx.x)))+(i18s);
if( ((i18s)>=(100LL)))  goto _BB_9;
if( ((i19i)>=(100LL)))  goto _BB_9;
_k8_1 = (i19i)*(4LL);
(( int*)p2/* zv */)[i19i] = (int)(i19i);
_BB_9: ;
i18s = ((long long)(((int)gridDim.x)*(128)))+(i18s);
if( ((i18s)<(100LL)))  goto _BB_8;
}



extern "C" __global__ __launch_bounds__(128) void
fn_1_1ii_179_gpu(
    long long tc1,
    signed char* p3/* zv */)
{
long long _k8_1;
long long i19s;
long long i20i;
i19s = 0LL;
_BB_8: ;
i20i = ((long long)((((int)blockIdx.x)*(128))+((int)threadIdx.x)))+(i19s);
if( ((i19s)>=(tc1)))  goto _BB_9;
if( ((i20i)>=(tc1)))  goto _BB_9;
_k8_1 = (i20i)*(4LL);
(( int*)p3/* zv */)[i20i] = (int)(i20i);
_BB_9: ;
i19s = ((long long)(((int)gridDim.x)*(128)))+(i19s);
if( ((i19s)<(tc1)))  goto _BB_8;
}

The first is the fixed 100 count loop, and the second is the generic one. The first works, the second doesn’t.

Hi Aaron,

Ok, so the only difference between these two generated kernels is that loop bounds is being passed in the “rc” case. This means the problem must be on the host side with either the value of “rc” being wrong or the launch configuration being computed incorrectly.

What data type is “rc”? Is “B” an int or long long? If it’s long long, what happens if you change it to an int?

What is the compiler feedback messages for this loop (-Minfo=accel) for both the “rc” and “100” cases?

What does the profile information say about each of these two loops? In particular, I’m looking for the launch configuration (grid/block size). To gather the profile information, set PGI_ACC_TIME=1 in your environment.

What happens if you explicitly set “rc=100” instead of “rc=rr==0?zc:rc”?

  • Mat

I’ve finally managed to get a version that should compiled without requiring any additional third party software, so you can try this out on your own machine.

#include <math.h>
#include <stdio.h>
#include <string.h>
#include <inttypes.h>
#include <stdlib.h>
#ifdef _OPENACC
#include <accelmath.h>
extern unsigned int __popcnt (unsigned int);
#endif
int isinit=0;
#define PI 3.14159265358979323846
typedef uint64_t BOUND;
typedef int32_t aplint32;
typedef BOUND B;typedef long long int L;typedef aplint32 I;typedef double D;typedef void V;
typedef unsigned char U8;
struct array {I r; B s[15];I f;B c;B z;V*v;};
typedef struct array A;
#define DO(i,n) for(L i=0;i<(n);i++)
#define R return
V frea(A*a){if (a->v!=NULL){char*v=a->v;B z=a->z;
 if(a->f){
#ifdef _OPENACC
#pragma acc exit data delete(v[:z])
#endif
}
 if(a->f>1){free(v);}}}
V aa(A*a,I tp){frea(a);B c=1;DO(i,a->r)c*=a->s[i];B z=0;
 B pc=8*ceil(c/8.0);
 switch(tp){
  case 1:z=sizeof(I)*pc;break;
  case 2:z=sizeof(D)*pc;break;
  case 3:z=ceil((sizeof(U8)*pc)/8.0);break;
  default: error(16);}
 z=4*ceil(z/4.0);char*v=malloc(z);if(NULL==v)error(1);
 #ifdef _OPENACC
  #pragma acc enter data create(v[:z])
 #endif
 a->v=v;a->z=z;a->c=c;a->f=2;}
V ai(A*a,I r,B *s,I tp){a->r=r;DO(i,r)a->s[i]=s[i];aa(a,tp);}
V fe(A*e,I c){DO(i,c){frea(&e[i]);}}
V cpaa(A*t,A*s){frea(t);memcpy(t,s,sizeof(A));}
A*tenv=NULL;
A*env[]={NULL};

static void Init(A*z,A*l,A*r,A*penv[]){
A*env[]={tenv};
DO(i,0)tenv[i].v=NULL;
}

static void fn_1_1ii(A*z,A*l,A*r,A*penv[]){
A env0[1];A*env[]={env0,penv[0]};
DO(i,1)env0[i].v=NULL;
{B zc=1,rc=1,lc=1;
A *rslt=&env[0][0];A *rgt=r;A *lft=l;
I rr=rgt->r;I lr=lft->r;
B*restrict rs=rgt->s;B*restrict ls=lft->s;
aplint32 *restrict rv=(rgt)->v;aplint32 *restrict lv=(lft)->v;
I zr;B zs[15];
if(rr!=0&&lr!=0&&abs(rr-lr)>1)error(4);int minr=rr>lr?lr:rr;
if(lr==rr&&rr>0){I n=rr-1;DO(i,n)if(rs[i+1]!=ls[i+1])error(5);}
else if(lr<rr){DO(i,lr)if(ls[i]!=rs[i+1])error(5);}
else{DO(i,rr)if(ls[i+1]!=rs[i])error(5);}
zs[0]=1;if(lr>rr){zr=lr;DO(i,lr)zs[i]=ls[i];}
else{zr=rr;DO(i,rr)zs[i]=rs[i];}
zr=zr==0?1:zr;zs[0]+=minr==zr?ls[0]:1;
ai(rslt,zr,zs,1);
aplint32 *restrict zv=(rslt)->v;
DO(i,zr)zc*=zs[i];DO(i,lr)lc*=ls[i];DO(i,rr)rc*=rs[i];
I zcp=zc;I rcp=rc;I lcp=lc;
I lt=lr!=0;I rt=rr!=0;zc/=zc==0?1:zs[0];rc=rr==0?zc:rc;lc=lr==0?zc:lc;
#pragma acc kernels loop present(zv[:zcp],lv[:lcp])
DO(i,lc){I lvi=lt*i;zv[i]=lv[lvi];}
#pragma acc kernels loop independent present(zv[:zcp],rv[:rcp])
DO(i,rc){I zvi=lc+i;I rvi=rt*i;zv[zvi]=rv[rvi];}
}
cpaa(z,&env[0][0]);
fe(&env0[1],0);}

int main(int argc, char *argv[]){
 A lft,rgt;I lr,rr;B ls[15];B rs[15];A rslt;rslt.v=NULL;
 lr=0;rr=1;rs[0]=5;ai(&lft,lr,ls,1);ai(&rgt,rr,rs,1);
 I *restrict lv;I *restrict rv;lv=lft.v;rv=rgt.v;
 lv[0]=5;DO(i,5){rv[i]=i;}
#pragma acc update device(lv[:1],rv[:5])
 fn_1_1ii(&rslt,&lft,&rgt,env);
 I *restrict zv=rslt.v;
#pragma acc update host(zv[:6])
 DO(i,6)printf("%d ",zv[i]);printf("\n");
 return 0;}

I’ve also discovered that when working with this example test file, using the -fast option causes the bug, while compiling without the -fast option seems to work. The same behaviors as I mentioned before are demonstrated including the rc=5 working.

I used ‘pgcc -fast -acc -ta=tesla:nollvm,nordc,cuda7.5 -o test test.c’ to compile the above, and without the -fast option it works. I also used 5 instead of rc in the DO(i,rc) loop and it worked. The output should be 5 0 1 2 3 4.

Thanks Aaron, the example really helps!

The issue appears to be that the compiler is passing in “1” instead of “5” for the loop bounds variable “rc”. Somehow setting “rc” using the conditional expression is causing the wrong value to be used.

I added TPR#22339 and sent it to our engineers for further evaluation. The work around is to change the conditional expression to an if statement:

#ifndef WORKS
 rc=rr==0?zc:rc;lc=lr==0?zc:lc;
#else
 if (rr==0) { rc=rr; };
 if (lr==0) { lc=zc; };
#endif



% pgcc -fast test_030716a.c -acc -ta=tesla:cc35 ; a.out
5 0 0 0 0 0
% pgcc -fast test_030716a.c -acc -ta=tesla:cc35 -DWORKS ; a.out
5 0 1 2 3 4

Note that I fixed a problem with the code where the lft.v and rgt.v arrays weren’t initialized so contained garbage. When passed to the “frea” routine, the code segv’d when trying to free the array.

Adding " lft.v=rgt.v=NULL;" in main fixes this issue.

  • Mat

Thanks for the help on this! I’ll see if I can make the changes here and make sure that those fixes work on my end.

As for the bug, that’s an error in this example that I overlooked because the main() function isn’t actually part of the code that is generated for this, and it was something I tried to come up with on the fly to isolate the code from the rest of the program. Thanks for the fix!

Just as another note. I’m seeing this bug crop up pervasively throughout my code, which makes heavy use of ?.

Issue should be resolved with PGI 20.3