# paralellize some loops with omp + acc

Hello,
My program has few FOR loops which can be parallelized.
With aproximation this is how the program works and the limits of FORs.

``````void function(int m)
{
CT
}

int main(void)
{
for (m = 1; m < 5; m++)
{
function(m);
for (p = 1; p < 8; p++)
for (lu = 1; lu < 251; lu++)
{
for (s = 0; s < 3; s++)
for (a = 0; a < 3; a++)
{
for (i = 1; i < 365; i++)
for (v = 1; v < 10; v++)
use CT
for (j = 1; j < 40; j++)
if minim
RES[m][p][0] = ...
}
for (t = 0; t < 3; t++)
for (c = 0; c < 85; c++)
{
for (i = 1; i < 365; i++)
for (v = 1; v < 10; v++)
use CT
for (j = 1; j < 40; j++)
if minim
RES[m][p][1] = ...
}
}
}
}
``````

I have a combined question, software + hardware.
How you can see, the final array of results depends only on m and p from first and second FORs, all other loops must be executed completely to write a value in RES.
I have an AMD with 8 physical cores and a GTX 690 2 x 1500 cuda cores.
I need your help how to parallelize this program OMP + ACC to use the hardware resources well.
My problems are:

• I don’t know what I have to write in every #pragma to make RES depending only on m and p and all other to be executed completlly
• maybe I can add an OMP NOWAIT to “lu FOR” loop and delete OMP from “p FOR” loop, this should avoid writing the dependency on “p” and a OMP atomic at RES
• where should I activate the second GPU ? the loops are a bit unbalanced because “c” goes till 85 since “a” only to 3

Bellow is a try as a beginner, any suggestion is welcome.

``````void function(int m)
{
CT
}

int main(void)
{
for (m = 1; m < 5; m++)
{
function(m);
#pragma acc data copyin(CT) copyout(RES)
{
#pragma omp parallel for schedule(dynamic) shared(??) default(none)
for (p = 1; p < 8; p++)
for (lu = 1; lu < 251; lu++)
#pragma acc region
{
#pragma acc loop independent vector(16)
for (s = 0; s < 3; s++)
#pragma acc loop independent vector(16)
for (a = 0; a < 3; a++)
{
#pragma acc loop independent vector(16)
for (i = 1; i < 365; i++)
for (v = 1; v < 10; v++)
use CT
for (j = 1; j < 40; j++)
#pragma acc atomic
if minim
RES[m][p][0] = ...
}
#pragma acc loop independent vector(16)
for (t = 0; t < 3; t++)
#pragma acc loop independent vector(16)
for (c = 0; c < 85; c++)
{
#pragma acc loop independent vector(16)
for (i = 1; i < 365; i++)
for (v = 1; v < 10; v++)
use CT
for (j = 1; j < 40; j++)
#pragma acc atomic
if minim
RES[m][p][1] = ...
}
}
}
}
}
``````

Hi tasica,

There’s no need for OpenMP in this case since the code doesn’t do any computation on the host and you only have one GPU. I’d suggest doing something like the following. Note that “region” is from the old PGI Accelerator model and has been deprecated. You should using OpenACC’s “parallel” or “kernels” compute constructs instead. Also, OpenACC does not allow for nested “vector” loops. You should “collapse” these loops instead. Finally, would use a min reduction for the results rather than using atomics. It means adding a few scalars, but you wont need to copy “RES” to the device.

`````` void function(int m)
{
CT
}

int main(void)
{

....

// move the device creation of CT out of the "m" loop so that it only
//  gets created once.
#pragma acc data create(CT)
{

for (m = 1; m < 5; m++)
{
function(m);
// Update CT's device value
#pragma acc update device(CT)

min0 = MAX_VAL;
min1 = MAX_VAL;

#pragma acc parallel loop gang collapse(2) reduction(min:min0,min1)
for (p = 1; p < 8; p++)  {
for (lu = 1; lu < 251; lu++)  {

#pragma acc loop worker collapse(2) reduction(min:min0)
for (s = 0; s < 3; s++)
for (a = 0; a < 3; a++)
{
#pragma acc loop vector
for (i = 1; i < 365; i++)
for (v = 1; v < 10; v++)
use CT
#pragma acc loop vector reduction(min:min0)
for (j = 1; j < 40; j++)
if minim
min0=....
}
#pragma acc loop worker collapse(2) reduction(min:min1)
for (t = 0; t < 3; t++)
for (c = 0; c < 85; c++)
{
#pragma acc loop vector
for (i = 1; i < 365; i++)
for (v = 1; v < 10; v++)
use CT
#pragma acc loop vector reduction(min:min1)
for (j = 1; j < 40; j++)
if minim
min1 = ...
}

}  // end "lu" loop
}  // end "p" loop

RES[m][p][0] = min0;
RES[m][p][1] = min1;

}
}
}
``````

I really appreciate your suggestions, I will try to implement them in the next 2 day.
I hope it will work smooth.
Thank you very much :-)

PROBLEM SOLVED - COMMENTING PRAGMA DORECTIVES ONE BY ONE I FOUND THE MISTAKE IN WRITTING RESULTS RES, ALL OTHER PRAGMA WERE CORRECT
SUBJECT CLOSED

update:
I’ve got some strange messages.
Searching similar topics, I found problems with -g or not enough space in \tmp …
This is why you find 3 different collections of options at compilation.
From where are they coming ?
I’am usong a qsort function and a do-while inside paralell, maybe qsort is the reason

Not to get confussed on discrepances between GTX 690 and the option tesla:cc50 I must mention that I work in a new laptop and run the program for results on GTX 690

``````1__________________
\$ pgc++ 2.cpp -Minfo=accel -Mconcur -ta=tesla:cc50 -acc -ta=multicore -fast -Mprof=ccff -Mipa=libc

(../src/il2.c, line 22231)
pgc++-Fatal-/opt/pgi/linux86-64/17.10/bin/pggpp2 TERMINATED by signal 6
Arguments to /opt/pgi/linux86-64/17.10/bin/pggpp2
/opt/pgi/linux86-64/17.10/bin/pggpp2 2.cpp -opt 2 -terse 1 -inform warn -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -vect 56 -y 34 16 -x 34 0x8 -x 32 8388608 -y 19 8 -y 35 0 -x 42 0x30 -x 39 0x40 -x 199 10 -x 39 0x80 -x 59 4 -tp haswell -x 120 0x1000 -astype 0 -x 121 1 -fn 2.cpp -il /tmp/pgc++LsDV9T56LEk.il -x 117 0x600 -x 123 0x80000000 -x 123 4 -x 119 0x20 -def __pgnu_vsn=60401 -autoinl 10 -x 168 400 -x 174 128000 -x 14 0x200000 -x 46 4 -x 120 0x200000 -x 70 0x40000000 -x 164 0x800000 -accel tesla -accel multicore -x 180 0x4000400 -x 121 0xc00 -x 163 0x1 -x 186 0x80000 -cudaver 7.5 -x 194 0x40000 -x 176 0x100 -cudacap 50 -x 186 0x80000 -x 180 0x4000400 -x 121 0xc00 -x 194 0x40000 -x 180 0x4000400 -x 121 0xc00 -x 163 1 -x 186 0x80000 -x 180 0x400 -x 121 0xc00 -x 210 8 -x 176 0x100 -cudacap 50 -x 189 0x8000 -y 163 0xc0000000 -x 189 0x10 -y 189 0x4000000 -x 0 0x1000000 -x 2 0x100000 -x 0 0x2000000 -x 161 16384 -x 162 16384 -concur 512 -x 9 1 -x 42 0x14200000 -x 72 0x1 -x 136 0x11 -quad -x 119 0x10000000 -x 129 0x40000000 -x 129 2 -x 164 0x1000 -x 2 0x100000 -x 89 0x80800842 -x 66 0x20000 -x 14 0x104 -y 89 0x40 -gnuvsn 60401 -x 69 0x200 -cmdline '+pgc++ /tmp/pgc++LsDV9T56LEk.il -Minfo=accel -Mconcur -ta=tesla:cc50 -acc -ta=multicore -fast -Mvect=sse -Mcache_align -Mflushz -Mpre -Mprof=ccff -Mipa=libc' -ipacomp pgc++ -ipacompsw '+\$n -Minfo=accel -Mconcur -ta=tesla:cc50 -acc -ta=multicore -fast -Mvect=sse -Mcache_align -Mflushz -Mpre -Mprof=ccff -Mipa=libc -c --ZIPAimport \$f \$b \$a -o \$o' -ipaexport /tmp/pgc++TsDhPtAKHnI.ipx -exfile /tmp/pgc++vsD-XqxbpfN.ipn -exifile /tmp/pgc++9sD3b3pEFYp.ipm -ipofile /tmp/pgc++fsDpVflfxMx.ipo -ccff /tmp/pgc++DsDxmeOO-B-.ccff -asm /tmp/pgc++nsDNR9TPzob.s

2__________________
\$ pgc++ -g 2.cpp -lgsl -lgslcblas -lm -mp -acc

(../src/il2.c, line 22231)
pgc++-Fatal-/opt/pgi/linux86-64/17.10/bin/pggpp2 TERMINATED by signal 6
Arguments to /opt/pgi/linux86-64/17.10/bin/pggpp2
/opt/pgi/linux86-64/17.10/bin/pggpp2 2.cpp -debug -x 120 0x200 -opt 0 -terse 1 -inform warn -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -x 59 4 -tp haswell -astype 0 -fn 2.cpp -il /tmp/pgc++gEDsPFGvB60.il -x 117 0x600 -x 123 0x80000000 -x 123 4 -x 119 0x20 -def __pgnu_vsn=60401 -x 120 0x200000 -x 70 0x40000000 -x 164 0x800000 -accel tesla -accel host -x 186 0x80000 -x 180 0x4000400 -x 163 0x1 -cudaver 7.5 -x 176 0x100 -cudacap 30 -cudacap 35 -cudacap 50 -x 121 0xc00 -x 194 0x40000 -x 189 0x8000 -y 163 0xc0000000 -x 163 0x800000 -x 189 0x10 -y 189 0x4000000 -mp -x 69 0x200 -x 69 0x400 -gnuvsn 60401 -x 69 0x200 -x 119 0x08 -cmdline '+pgc++ /tmp/pgc++gEDsPFGvB60.il -g -lgsl -lgslcblas -lm -mp -acc' -asm /tmp/pgc++2EDIZCFS1Yx.s

3________________________
\$ pgc++ -fast -Mipa=libc -ta=tesla:cc50 -acc -mp -Minfo=accel -lstdc++ -lgsl -lgslcblas -lm 2.cpp

(../src/il2.c, line 22231)
pgc++-Fatal-/opt/pgi/linux86-64/17.10/bin/pggpp2 TERMINATED by signal 6
Arguments to /opt/pgi/linux86-64/17.10/bin/pggpp2
/opt/pgi/linux86-64/17.10/bin/pggpp2 2.cpp -opt 2 -terse 1 -inform warn -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -vect 56 -y 34 16 -x 34 0x8 -x 32 8388608 -y 19 8 -y 35 0 -x 42 0x30 -x 39 0x40 -x 199 10 -x 39 0x80 -x 59 4 -tp haswell -x 120 0x1000 -astype 0 -x 121 1 -fn 2.cpp -il /tmp/pgc++zPDlBCFMVFt.il -x 117 0x600 -x 123 0x80000000 -x 123 4 -x 119 0x20 -def __pgnu_vsn=60401 -autoinl 10 -x 168 400 -x 174 128000 -x 14 0x200000 -x 46 4 -x 120 0x200000 -x 70 0x40000000 -x 164 0x800000 -accel tesla -x 180 0x4000400 -x 121 0xc00 -x 163 0x1 -x 186 0x80000 -cudaver 7.5 -x 194 0x40000 -x 176 0x100 -cudacap 50 -x 186 0x80000 -x 180 0x4000400 -x 121 0xc00 -x 194 0x40000 -x 189 0x8000 -y 163 0xc0000000 -x 189 0x10 -y 189 0x4000000 -x 9 1 -x 42 0x14200000 -x 72 0x1 -x 136 0x11 -quad -x 119 0x10000000 -x 129 0x40000000 -x 129 2 -x 164 0x1000 -x 89 0x80800842 -x 66 0x20000 -x 14 0x104 -y 89 0x40 -mp -x 69 0x200 -x 69 0x400 -x 0 0x1000000 -x 2 0x100000 -x 0 0x2000000 -x 161 16384 -x 162 16384 -gnuvsn 60401 -x 69 0x200 -cmdline '+pgc++ /tmp/pgc++zPDlBCFMVFt.il -fast -Mvect=sse -Mcache_align -Mflushz -Mpre -Mipa=libc -ta=tesla:cc50 -acc -mp -Minfo=accel -lstdc++ -lgsl -lgslcblas -lm' -ipacomp pgc++ -ipacompsw '+\$n -fast -Mvect=sse -Mcache_align -Mflushz -Mpre -Mipa=libc -ta=tesla:cc50 -acc -mp -Minfo=accel -lstdc++ -lgsl -lgslcblas -lm -c --ZIPAimport \$f \$b \$a -o \$o' -ipaexport /tmp/pgc++5PDRI9samYH.ipx -exfile /tmp/pgc++bPDd4jRfmxa.ipn -exifile /tmp/pgc++jPDB44iAnaJ.ipm -ipofile /tmp/pgc++XPDtGrube_d.ipo -asm /tmp/pgc++HPDJ0Fld5vP.s
``````

Hi tasica,

If you can, please send a reproducing example of the file that caused the error to PGI Customer Service (trs@pgroup.com).

While the cause was a user error with a misspelled pragma, the compiler should have caught this or at least not fail with an internal error. I’d like to have our engineers look at this.

Thanks,
Mat

Hi Mat,

Next days I will make a short version of my program
Until then ATOMIC was the #pragma which I commented and than the program was giving me back errors as in traditional style.
The solution was to follow your suggestion with scalar and I avoid using ATOMIC.
In the beginning I didn’t want to use scalar because it was necessary to write more lines. Anyway, It looks than I must learn to use ATOMIC.
Also gcc was complaining at the way how I’ve written ATOMIC #pragma, something with {} symbols

``````double med, *****re9;
int *sir;

// if actual value (med) smaller as one already stored in position oCr, than store all new enviroment *sir, med, r1, r2

#pragma acc atomic
{
if (med < re9[mm][j][p][oCr][t])
{
for (int c = 0; c < oCr; c++)
re9[mm][j][p][c][t] = (double) (sir[c]);
re9[mm][j][p][oCr][t] = med;
re9[mm][j][p][oCr + 1][t] = (double)r1;
re9[mm][j][p][oCr + 2][t] = (double)r2;
}
}
``````

Hi tasica,

Atomics can only be applied to a specific storage location and are only needed when you have collisions on that location. A collision is when multiple threads access the same location and thus could produce a race condition.

So here you’re trying to make a region atomic instead of a specific memory location. This more like an OpenMP critical section. Critical sections aren’t supported in OpenACC since OpenACC doesn’t presume that there is support for global synchronization between threads which is required for critical sections to work.

Without the full code it’s difficult to give you recommendations here, but my assumption is that you probably don’t need to use atomics. Most likely you’re parallelizing across the first, second, and/or third dimension of “re9” so the isn’t a collision on the fourth and fifth dimensions. Though for illustration, let’s assume that there’s a collision, in which case you’d do something like the following:

``````
tmp = re9[mm][j][p][oCr][t];

if (med < tmp)
{
for (int c = 0; c < oCr; c++)
re9[mm][j][p][c][t] = (double) (sir[c]);
#pragma acc atomic write
re9[mm][j][p][oCr + 1][t] = (double)r1;
#pragma acc atomic write
re9[mm][j][p][oCr + 2][t] = (double)r2;
}
``````

Note since “atomic” can slow your code down, it’s recommended to refactor algorithms to remove any collisions and only use atomics if this is not possible.

-Mat

Mat,
thank you very much for your clarification.
Now I know from where the “strage” message was coming, I used Atomic wrong.
I hope it is not anymore necessary to sent the code.
I want to help to improve the compiler, but as a beginner it takes me a bit to write it and I also don’t have so much free time.
gcc was showing the error, pgc was terminated by signal 6
Let me know if you need the code, I will find time to write it.

second problem
I’ve noticed pgdbg give SIGSEGV signal with few lines (2, 4 sometimes) before a “#pragma acc data copy” with not contiguous array, not exactelly at the #pragama line
the message

`````` SIGSEGV at 0x7FF9AB906D72, function __c_mcopy1, line
``````

says nothing about not contiguous array, I’ve just see it in gcc and I suppose is this
can be other problem? without acc the program and pgdbg are running well

``````#pragma acc data copyin(ve1[0:oIe][0:os], vd3[1:oIe][0:opn], vj4[1:oAc][0:ov5[o38 - 1][1]], ls7[o32:o38][ov5[o32][0]:ov5[o38 - 1][1]][0:(ol + 1)]) create(ct8[o32:o38][0:oAc][0:oCr][0:6][1:ov5[o38 - 1][1]][0:os])
{
#pragma acc data copyout(re9[o32:o38][(oIe - oVr):oIe][ov5[o32][0]:ov5[o38 - 1][1]][0:(oCr + 3)][0:2])
``````

until now, as a beginner using basic acc utilities, using pgc I see advantages in messages coming at compilation with -Minfo but at errors is not straightforward

``````int *CT, *RES;

void function(int m) {
CT[];
}

int main(void) {
int oC = 22;
#pragma acc declare create(oC)

#pragma acc data create(sir[0:oC], CT[])
{
for (m = 1; m < 5; m++)
{
function(m);
#pragma acc update device(CT[])

for (p = 1; p < 8; p++) {
min0 = MAX_VAL;
min1 = MAX_VAL;

#pragma omp parallel for firstprivate(sir) reduction(min:min0,min1)
for (min_lu[p]; lu < max_lu[p]; lu++)
{

do {                //line 1192
#pragma acc update device(sir[0:oC])

#pragma acc parallel loop gang reduction(min:min0, min1)
for (s = 0; s < 3; s++)
{
#pragma acc loop worker reduction(min:min0)
for (a = 0; a < 3; a++)
{
int x = 0;
int *restrict pox[] ...  // 1000 elements

#pragma acc loop vector reduction(+:x)
for (i = 1; i < 365; i++)           //line 1214
for (v = 1; v < 10; v++) {
use CT, sir
pox[x] = ...
x++;
free(something);                         //line 1275
}

#pragma acc loop vector reduction(min:min0)
for (j = 1; j < 40; j++) {
use pox[x]
if (pox)                          //line 1286
min0 = ...
}

free(pox);
}
}

} while(sir) //dependence

}  // end "lu" omp

RES[m][p][0] = min0;
RES[m][p][1] = min1;
}  // end "p"
} // end "m"
}
}
``````

Hi Mat,
In some posts I saw “omp+acc possible but a bit tricky”.
I am back on this topic because at that time I didn’t know what is important to ask and what not.
Unfortunatelly I can only explain, no code

The reasons why “omp+acc” I think fits my case:

• inter-dependency in do-while loops, so I use “omp” outside of do-wehile and “acc” inside do-while.
• the RES-uslts are depending only on “m” and “p”, all other loops can be executed completlly in parallel
• I don’t have big workload in the device, means with “omp” I can increase occupacy.

Compiling only with “omp” the code is working.
But there is a difference in the -Minfo output (2 more lines) when I compile “acc+omp” compare with only “acc”.

``````1213, 1334 Accelerator restriction: size of the GPU copy of "sir" is unknown.
``````

Question_1
Why size of “sir” is unknown when I add “omp”?
“sir” is acc “declare create” global, “firstprivate” in “omp” and updated before acc starts, I think it is enough.

with regards of reduction(min:min0,min1)
Question_2
Is there any possibility to count how many times min0 and min1 are reached ?
I can do again the loops and count them, but it means double the time.

Just fixing/clearing my knowledges.
Question_3
Where should I apply *restrict to my arrays ?
This question is in connection with the topic below where you added *restrict at the declaration and I don’t know why.
Do you have a hint or can I see in the output which array wants/needs *restrict ?
https://forums.developer.nvidia.com/t/contiguous-flated-multidimensional-array/135596/1
I’ve done that question in a separat topic, I hoped beginners like me can use the title in searching

``````main:
1146, Generating copyin(gvd3[:(u1-v1)*(u2-v2)])
Generating copyout(gre9[:(h1-k1)*((h2-k2)*((h3-k3)*((h4-k4)*(h5-k5))))])
Generating copyin(gvj4[:(p1-b1)*(p2-b2)],gve1[:(f1-l1)*(f2-l2)])
Generating create(gct8[:(e1-a1)*((e2-a2)*((e3-a3)*((e4-a4)*(e5-a5))))])
Generating copyin(gls7[:(m1-n1)*((m2-n2)*(m3-n3))])
1192, Generating copyin(p2,oSt,oIe,ol,oVr,n3,oCr,a4,a3,a2,a1,e4,e3,e2,e1,b2,n2,n1,m3,m2,m1,l2,u1,sir[:oCr],v2,v1,u2,b1,a5,p1,os,l1,k5,f1,e5,k4,k3,k2,k1,h5,h4,h3,h2,h1,f2)
Generating update device(sir[:oCr])
Accelerator kernel generated
Generating Tesla code
1192, Generating reduction(min:mY,mX)
Vector barrier inserted due to potential dependence into a vector loop
1197, #pragma acc loop gang /* blockIdx.x */
1200, #pragma acc loop worker(4) /* threadIdx.y */
Generating reduction(min:mX)
1205, #pragma acc loop seq
1207, #pragma acc loop seq
1211, #pragma acc loop vector(32) /* threadIdx.x */
Generating reduction(+:x)
Vector barrier inserted to share data across vector lanes
1220, #pragma acc loop seq
1227, #pragma acc loop seq
1229, #pragma acc loop seq
1231, #pragma acc loop seq
1235, #pragma acc loop seq
1241, #pragma acc loop seq
1248, #pragma acc loop seq
1250, #pragma acc loop seq
1257, #pragma acc loop seq
1259, #pragma acc loop seq
1265, #pragma acc loop seq
1267, #pragma acc loop seq
1271, Generating implicit reduction(+:x)
1275, Vector barrier inserted due to potential dependence out of a vector loop
Vector barrier inserted to share data across vector lanes
1281, #pragma acc loop vector(32) /* threadIdx.x */
Generating reduction(min:mX)
1284, #pragma acc loop seq
1286, Vector barrier inserted due to potential dependence out of a vector loop
1200, Loop is parallelizable
1205, Loop is parallelizable
1207, Loop is parallelizable
1211, Loop is parallelizable
1220, Loop is parallelizable
1227, Loop is parallelizable
1229, Loop is parallelizable
1231, Loop is parallelizable
1235, Loop is parallelizable
1241, Loop is parallelizable
1248, Loop is parallelizable
1250, Loop is parallelizable
1257, Loop is parallelizable
1259, Loop is parallelizable
1265, Loop is parallelizable
1267, Loop is parallelizable
1281, Loop is parallelizable
1284, Loop is parallelizable
``````

Hi tasica,

Question_1
Why size of “sir” is unknown when I add “omp”?
“sir” is acc “declare create” global, “firstprivate” in “omp” and updated before acc starts, I think it is enough.

“sir” is a OpenMP firstprivate array meaning that each OpenMP thread will have it’s own private copy of the array. Hence, the “sir”, being used in the OpenMP region is different from the global “sir”. Try moving the OpenACC data directive into the OpenMP for loop so that each thread’s “sir” gets created and copied over to the device.

Question_2
Is there any possibility to count how many times min0 and min1 are reached ?
I can do again the loops and count them, but it means double the time.

Not within the min reduction itself, but you proabably could add a second “sum” reduction to count the number of times it enters the if statement. I haven’t tried this myself, though I’m thinking that you might need to have a second if statement for the sum since the idiom recognition for is looking for an "if minim … min = " structure.

Try:

``````#pragma acc loop vector reduction(min:min0) reduction(+:mincnt)
for (j = 1; j < 40; j++)
if minim  {
mincnt += 1;
min0=....
}
``````

or if that doesn’t work:

``````#pragma acc loop vector reduction(min:min0) reduction(+:mincnt)
for (j = 1; j < 40; j++) {
if minim
mincnt += 1;
if minim
min= ...
}
``````

Question_3
Where should I apply *restrict to my arrays ?

In the declaration of the pointers.

``````int * restrict CT, * restrict RES;
``````

In C, different pointers of the same type are allowed to point at the same object (i.e. aliased) and therefore may cause a dependency. The C99 restrict keyword asserts to the compiler that the pointer does not alias another object.

See: http://en.cppreference.com/w/c/language/restrict for details about “restrict”

Hope this helps,
Mat

Hi Mat,

Yes, it was very helpful, thank you for the suggestions !!
My program started to work.

I have updated “the logical sketch” and the “output” in the post above, I will refer to there.
It remains only 3 lines in the output which I don’t know how to pass them.

``````lines 1211 1275 1286 Vector barrier inserted
``````

Do not care about line 1192, there the message is coming from do-while loop which has dependence - no problem.
In my opinion all 3 errors are coming from array “pox”.
“pox” has a dimension of (1000) because I don’t know in the beginning how much to allocate.
“x” is keeping this information - how many elements has “pox”
I’ve done reduction for “x”, but I don’t know what to do with “pox”
the 2 errors in line 1275 is pointing at a “free(something)” which is other array, but I think they belongs to “pox”, it is like repeating the errors in the end of the loops
Question_1
What should I write for “pox” since for that loop vector “reduction type not supported” and “shared is invalid text in pragma” ?

Until now I helped myself with error messages from gcc because it is straightforward comparison with pgc. Since I added *restrict to the arrays in the device, gcc gives all *restrict as an error. I am a bit confused.
Question_2
why is this happened ? should be recognize by gcc too. Not so important this question, just as information.

If the situation is not easy to understand from my explanation, let me know and I will try to make it clear.

Bellow are the running messages, for me not big help now, but maybe for you is helpful.

``````\$ ./a.out
p = 6 i = 0 p = 6 i = 1 p = 6 i = 2
Accelerator Kernel Timing data
/home/a/Documents/3.cpp
main  NVIDIA  devicenum=0
time(us): 14,596
1146: data region reached 1 time
1146: data copyin transfers: 4
device time(us): total=49 max=26 min=5 avg=12
1192: data region reached 10 times
1192: data copyin transfers: 135
device time(us): total=484 max=9 min=3 avg=3
1192: update directive reached 874 times
1192: data copyin transfers: 874
device time(us): total=3,556 max=21 min=3 avg=4
1192: compute region reached 874 times
1192: data copyin transfers: 874
device time(us): total=2,651 max=13 min=3 avg=3
1192: kernel launched 874 times
grid: [3]  block: [32x4]
elapsed time(us): total=1,781,626 max=2,397 min=1,841 avg=2,038
1192: reduction kernel launched 873 times
grid: [2]  block: [256]
elapsed time(us): total=12,574 max=32 min=13 avg=14
1192: data copyout transfers: 873
device time(us): total=7,856 max=26 min=8 avg=8
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
``````

Hi tasica,

Question_1
What should I write for “pox” since for that loop vector “reduction type not supported”

Let’s review what you have posted for “pox”, since there are a number of issues here:

``````                     #pragma acc loop worker reduction(min:min0)
for (a = 0; a < 3; a++)
{
int x = 0;
int *restrict pox ...  // 1000 members

#pragma acc loop vector reduction(+:x)
for (i = 1; i < 365; i++)           //line 1214
for (v = 1; v < 10; v++) {
use CT, sir
pox[x] = ...
x++;
free();                         //line 1275
}

#pragma acc loop vector reduction(min:min0)
for (j = 1; j < 40; j++) {
use pox[x]
if (pox)                          //line 1286
min0 = ...
}

free(pox);
}
``````

First, let me explain how reductions work. Basically by putting the “x++” you’ve caused a dependency in the loop since loop iterations depends on what the value of x was in the previous iteration before it can be updated. However with reductions, the loop can be parallelized by having each thread make a private copy of x, doing a partial reduction on it’s copy, then after the loop, a final reduction is performed on the partial reductions.

But here, you’re using the value of x as the index to pox. If you use a reduction, you’d be using the private x not the shared and hence be updating the wrong element in pox.

If the order in which the values of pox are added doesn’t matter, you could try using an atomic capture where the shared variable is updated atomically. Something like:

int xx;
#pragma acc atomic capture
xx = x++
pox[xx] = …

Atomic operations can be slow, however, so you may loose some performance.

Question_1
and “shared is invalid text in pragma” ?

I don’t see what you have a “shared” clause in your example, but “shared” isn’t an OpenACC clause. It is an OpenMP clause, so perhaps you’re trying to mix the two?

Until now I helped myself with error messages from gcc because it is straightforward comparison with pgc. Since I added *restrict to the arrays in the device, gcc gives all *restrict as an error. I am a bit confused.
Question_2
why is this happened ? should be recognize by gcc too.

“restrict” is part of C99. Up until recently gcc defaults to using C89. Try setting “-std=cc99” to get gcc to recognize restrict.

From your run output, it looks like your code is getting an illegal address error and crashing:

call to cuMemFreeHost returned error 700: Illegal address during

Without the full code, I can’t know for sure where or why this occurs. You can try setting the environment variable “PGI_ACC_NOTIFY=1” to have the compiler show you all the kernel launches. You can then see the last one that was executed. There’s also “PGI_ACC_DEBUG=1” which will give you very detailed information about all OpenACC runtime calls being made.

It does look like you are allocating memory from within device code. Keep in mind that there’s a very small heap (~8MB) on the device which can be consumed very quickly. If you run out of heap space, this can cause the program to crash with this error. I don’t know for sure if this is happening here, but you can try increasing the heap size via the environment variable “PGI_ACC_CUDA_HEAPSIZE=”.

Another possible issue is that you note that por is 1000 elements but it looks like x gets updated for every iteration of the i and v loops which is 3650. This would cause you to be writing off the end of the por array and thus cause an illegal address error.

Also, why is there an empty “free()” statement at line 1275?

Note that without a full reproducing example, much of my answers are just educated guesses. If you are still stuck, having you post the full code, or send a copy to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me would be helpful.

Though before doing this, be sure your program is correct on the host (i.e when not using OpenACC). For example if you are accessing por outside it’s bounds, this error would also appear in the host version. It doesn’t always cause the program to crash (it depends upon what extra memory you’re writing in) so using utilities such as Valgrind to debug the code are very useful.

-Mat

Hi Mat,

Maybe my experience helps improving pgc.
For different reasons you are the only source of learning openacc, after every discussion with you I’ve made good steps in learning “acc”.
Few remarks:

• openacc has a fast development, many things which I found in Google/Youtube (the traditional way of learning) were obsolete, deprecate (ex #pragma acc region …)
• I thought “lets follow the latest version of the pdf guides”, So I’ve started to read last versions of Guides from openacc.org and pgroup.com. The results were better than first way, but missing or poor examples for directives and clauses.
• there are also differences between what openacc2.5/2.6 require and what is already implemented in pgc and what not. I also saw users in forum asking you about some clauses and your answer “is work in progress”, I am not alone in using wrong things.
• “lets read the posts in users forum” The results were better than second way, but still not good. The reasons are, to find exactly what I’m interested in, it takes long because sometimes the titles are not suggestive or the post is going in different direction from were it starts with 100 posts.
• pgc compiler is not perfect in explaining errors and an example can help to avoid errors.

So, I am missing an up to date overview, like an excel table or webpage where I can find:

``````all directives 			all clause  introduced			deprecated		example C code source
----------------------------------------------------------------------------------------------
#pragma acc region			      	version 10.0		version 14.0	obsolete-no example

#pragma acc atomic 		write		version 10.0						  link to example
capture	  version 12.0						 same link to same example
link		 work in progress					 no example yet
``````

“link to the example”, one example can have few clauses (as commented lines).
This helps me not to use obsolete directives and see an example how to use a specific directive+clause to avoid errors.
“Reference Guide” has a bit of table but no example and it is a book.
A table is much structured as a book and engineers like graphs/tables than reading pages.
If you already have a clear/simply way how to learn acc and I don’t know, please excuse me for writing this story and guide me to it.
May I use email for comments at 2.cpp ? Or should I ask here ?

Hi Mat,

I have tried to implement “#pragma atomic capture” following the suggestions from forum.
I receive “nvvmCompileProgram error: 9” and I gave up.
Now I’ve installed all new software (cuda9.1 and patches+Driver Version:390.30) because I saw some posts about nvvm saying “old drivers”.
But I have the same error compiling 2.cpp
If it works in your computer, maybe my compiler is still missing something.

Q_1
I don’t know how to jump over this error
“Error: 2.cpp(292) Error: unsupported operation”
is pointing at the “xx = x++;”

``````	int xx;
#pragma acc atomic capture
xx = x++;
``````

bellow is the output from 2.cpp and the link shows the compiler configuration

``````main:
191, Generating copyin(ls7[:(m1-n1)*(m2-n2)],vj4[:(p1-b1)*(p2-b2)],ve1[:(f1-l1)*(f2-l2)],ct8[:(e1-a1)*((e2-a2)*((e3-a3)*((e4-a4)*(e5-a5))))])
205, Generating enter data create(sir[:22])
224, Generating update device(sir[:22])
Generating present(sir[:])
Accelerator kernel generated
Generating Tesla code
224, Generating reduction(min:mX)
227, #pragma acc loop gang collapse(2) /* blockIdx.x */
230,   /* blockIdx.x collapsed */
234, #pragma acc loop vector(128) collapse(2) /* threadIdx.x */
239, #pragma acc loop vector(128) /* threadIdx.x */
243, #pragma acc loop seq
248, #pragma acc loop seq
249, #pragma acc loop seq
250, #pragma acc loop seq
256, #pragma acc loop seq
261, #pragma acc loop seq
268, #pragma acc loop seq
272, #pragma acc loop seq
273, #pragma acc loop seq
280, #pragma acc loop seq
282, #pragma acc loop seq
287, #pragma acc loop seq
288, #pragma acc loop seq
301, #pragma acc loop vector(128) /* threadIdx.x */
Generating reduction(min:mX)
234, Loop is parallelizable
235, Loop is parallelizable
239, Loop is parallelizable
243, Loop is parallelizable
248, Loop carried reuse of mx-> prevents parallelization
249, Loop carried dependence of mx-> prevents parallelization
Loop carried backward dependence of mx-> prevents vectorization
250, Loop is parallelizable
256, Accelerator restriction: size of the GPU copy of tm,mx is unknown
Loop is parallelizable
261, Accelerator restriction: size of the GPU copy of tm is unknown
Loop is parallelizable
268, Accelerator restriction: size of the GPU copy of ti is unknown
Loop is parallelizable
272, Loop carried reuse of ti-> prevents parallelization
Loop carried scalar dependence for k at line 276,275
273, Accelerator restriction: size of the GPU copy of tm,ti is unknown
Loop carried reuse of ti-> prevents parallelization
Loop carried scalar dependence for k at line 276,275
280, Accelerator restriction: size of the GPU copy of ix is unknown
Loop is parallelizable
282, Accelerator restriction: size of the GPU copy of ti,ix is unknown
Loop is parallelizable
287, Loop carried reuse of pox-> prevents parallelization
288, Accelerator restriction: size of the GPU copy of pox is unknown
Complex loop carried dependence of pox-> prevents parallelization
Loop carried reuse of pox-> prevents parallelization
301, Loop is parallelizable
321, Generating exit data delete(sir[:1])
nvvmCompileProgram error: 9.
Error: 2.cpp(292) Error: unsupported operation
PGCC-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (2.cpp: 1)
PGCC/x86 Linux 17.10-0: compilation aborted
``````

Hi Tascia,

Q_1
I don’t know how to jump over this error
“Error: 2.cpp(292) Error: unsupported operation”
is pointing at the “xx = x++;”

I don’t see this error in the version of the code I sent you. Did you modify where “x” is declared or make other changes?

Only global variables or gang private (i.e. CUDA Shared) variables can used as atomic variables. If “x” is a local variable, you can get this error.

-Mat

Hi Mat,

(as usually) I had few mistakes combined which gave that message (wrong instalation + compiling options + error 999)
Now I have cleaned-up them and I would like you to have a look on the code.
Because are 3 txt files of entry data, may I use the email to send the code?

Sure. Though I’m a bit swamped with some major projects so may not have much time to spend on it. I’ll do my best.

-Mat