paralellize some loops with omp + acc

svenGorran · February 10, 2018, 11:19pm

Hello,
My program has few FOR loops which can be parallelized.
With aproximation this is how the program works and the limits of FORs.

void function(int m)
{
CT
}

int main(void)
{
	for (m = 1; m < 5; m++)
	{
		function(m);
		for (p = 1; p < 8; p++)
			for (lu = 1; lu < 251; lu++)
			{
				for (s = 0; s < 3; s++)
					for (a = 0; a < 3; a++)
					{
						for (i = 1; i < 365; i++)
							for (v = 1; v < 10; v++)
								use CT
						for (j = 1; j < 40; j++)
							if minim
								RES[m][p][0] = ...
					}
				for (t = 0; t < 3; t++)
					for (c = 0; c < 85; c++)
					{
						for (i = 1; i < 365; i++)
							for (v = 1; v < 10; v++)
								use CT
						for (j = 1; j < 40; j++)
							if minim
								RES[m][p][1] = ...
					}
			}
	}
}

I have a combined question, software + hardware.
How you can see, the final array of results depends only on m and p from first and second FORs, all other loops must be executed completely to write a value in RES.
I have an AMD with 8 physical cores and a GTX 690 2 x 1500 cuda cores.
I need your help how to parallelize this program OMP + ACC to use the hardware resources well.
My problems are:

I don’t know what I have to write in every #pragma to make RES depending only on m and p and all other to be executed completlly
maybe I can add an OMP NOWAIT to “lu FOR” loop and delete OMP from “p FOR” loop, this should avoid writing the dependency on “p” and a OMP atomic at RES
where should I activate the second GPU ? the loops are a bit unbalanced because “c” goes till 85 since “a” only to 3

Bellow is a try as a beginner, any suggestion is welcome.

void function(int m)
{
CT
}

int main(void)
{
	for (m = 1; m < 5; m++)
	{
		function(m);
		#pragma acc data copyin(CT) copyout(RES)
		{
			#pragma omp parallel for schedule(dynamic) shared(??) default(none)
			for (p = 1; p < 8; p++)
				for (lu = 1; lu < 251; lu++)
					#pragma acc region
					{
						#pragma acc loop independent vector(16)
						for (s = 0; s < 3; s++)
							#pragma acc loop independent vector(16)
							for (a = 0; a < 3; a++)
							{
								#pragma acc loop independent vector(16)
								for (i = 1; i < 365; i++)
									for (v = 1; v < 10; v++)
										use CT
								for (j = 1; j < 40; j++)
									#pragma acc atomic
									if minim
										RES[m][p][0] = ...
							}
						#pragma acc loop independent vector(16)
						for (t = 0; t < 3; t++)
							#pragma acc loop independent vector(16)
							for (c = 0; c < 85; c++)
							{
								#pragma acc loop independent vector(16)
								for (i = 1; i < 365; i++)
									for (v = 1; v < 10; v++)
										use CT
								for (j = 1; j < 40; j++)
									#pragma acc atomic
									if minim
										RES[m][p][1] = ...
							}
					}
		}
	}
}

MatColgrove · February 12, 2018, 4:42pm

Hi tasica,

There’s no need for OpenMP in this case since the code doesn’t do any computation on the host and you only have one GPU. I’d suggest doing something like the following. Note that “region” is from the old PGI Accelerator model and has been deprecated. You should using OpenACC’s “parallel” or “kernels” compute constructs instead. Also, OpenACC does not allow for nested “vector” loops. You should “collapse” these loops instead. Finally, would use a min reduction for the results rather than using atomics. It means adding a few scalars, but you wont need to copy “RES” to the device.

 void function(int m) 
{ 
CT 
} 

int main(void) 
{ 

....

// move the device creation of CT out of the "m" loop so that it only 
//  gets created once.
 #pragma acc data create(CT) 
  { 

   for (m = 1; m < 5; m++) 
   { 
      function(m);          
// Update CT's device value
#pragma acc update device(CT)

   min0 = MAX_VAL;
   min1 = MAX_VAL;
  
#pragma acc parallel loop gang collapse(2) reduction(min:min0,min1)
         for (p = 1; p < 8; p++)  {
            for (lu = 1; lu < 251; lu++)  {

                  #pragma acc loop worker collapse(2) reduction(min:min0)
                  for (s = 0; s < 3; s++) 
                     for (a = 0; a < 3; a++) 
                     { 
                        #pragma acc loop vector
                        for (i = 1; i < 365; i++) 
                           for (v = 1; v < 10; v++) 
                              use CT 
                        #pragma acc loop vector reduction(min:min0)
                        for (j = 1; j < 40; j++) 
                           if minim 
                               min0=....   
                     } 
                  #pragma acc loop worker collapse(2) reduction(min:min1)
                  for (t = 0; t < 3; t++) 
                     for (c = 0; c < 85; c++) 
                     { 
                        #pragma acc loop vector
                        for (i = 1; i < 365; i++) 
                           for (v = 1; v < 10; v++) 
                              use CT 
                        #pragma acc loop vector reduction(min:min1)
                        for (j = 1; j < 40; j++) 
                           if minim 
                              min1 = ... 
                     } 

                }  // end "lu" loop
               }  // end "p" loop
    
        RES[m][p][0] = min0;
        RES[m][p][1] = min1;

      }
   } 
}

svenGorran · February 13, 2018, 12:18am

I really appreciate your suggestions, I will try to implement them in the next 2 day.
I hope it will work smooth.
Thank you very much :-)

svenGorran · February 14, 2018, 4:58pm

PROBLEM SOLVED - COMMENTING PRAGMA DORECTIVES ONE BY ONE I FOUND THE MISTAKE IN WRITTING RESULTS RES, ALL OTHER PRAGMA WERE CORRECT
SUBJECT CLOSED

update:
I’ve got some strange messages.
Searching similar topics, I found problems with -g or not enough space in \tmp …
This is why you find 3 different collections of options at compilation.
From where are they coming ?
I’am usong a qsort function and a do-while inside paralell, maybe qsort is the reason

Not to get confussed on discrepances between GTX 690 and the option tesla:cc50 I must mention that I work in a new laptop and run the program for results on GTX 690

1__________________
$ pgc++ 2.cpp -Minfo=accel -Mconcur -ta=tesla:cc50 -acc -ta=multicore -fast -Mprof=ccff -Mipa=libc

Internal error: assertion failed: find_assoc_pragma: pragma not found
          (../src/il2.c, line 22231)
pgc++-Fatal-/opt/pgi/linux86-64/17.10/bin/pggpp2 TERMINATED by signal 6
Arguments to /opt/pgi/linux86-64/17.10/bin/pggpp2
/opt/pgi/linux86-64/17.10/bin/pggpp2 2.cpp -opt 2 -terse 1 -inform warn -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -vect 56 -y 34 16 -x 34 0x8 -x 32 8388608 -y 19 8 -y 35 0 -x 42 0x30 -x 39 0x40 -x 199 10 -x 39 0x80 -x 59 4 -tp haswell -x 120 0x1000 -astype 0 -x 121 1 -fn 2.cpp -il /tmp/pgc++LsDV9T56LEk.il -x 117 0x600 -x 123 0x80000000 -x 123 4 -x 119 0x20 -def __pgnu_vsn=60401 -autoinl 10 -x 168 400 -x 174 128000 -x 14 0x200000 -x 46 4 -x 120 0x200000 -x 70 0x40000000 -x 164 0x800000 -accel tesla -accel multicore -x 180 0x4000400 -x 121 0xc00 -x 163 0x1 -x 186 0x80000 -cudaver 7.5 -x 194 0x40000 -x 176 0x100 -cudacap 50 -x 186 0x80000 -x 180 0x4000400 -x 121 0xc00 -x 194 0x40000 -x 180 0x4000400 -x 121 0xc00 -x 163 1 -x 186 0x80000 -x 180 0x400 -x 121 0xc00 -x 210 8 -x 176 0x100 -cudacap 50 -x 189 0x8000 -y 163 0xc0000000 -x 189 0x10 -y 189 0x4000000 -x 0 0x1000000 -x 2 0x100000 -x 0 0x2000000 -x 161 16384 -x 162 16384 -concur 512 -x 9 1 -x 42 0x14200000 -x 72 0x1 -x 136 0x11 -quad -x 119 0x10000000 -x 129 0x40000000 -x 129 2 -x 164 0x1000 -x 2 0x100000 -x 89 0x80800842 -x 66 0x20000 -x 14 0x104 -y 89 0x40 -gnuvsn 60401 -x 69 0x200 -cmdline '+pgc++ /tmp/pgc++LsDV9T56LEk.il -Minfo=accel -Mconcur -ta=tesla:cc50 -acc -ta=multicore -fast -Mvect=sse -Mcache_align -Mflushz -Mpre -Mprof=ccff -Mipa=libc' -ipacomp pgc++ -ipacompsw '+$n -Minfo=accel -Mconcur -ta=tesla:cc50 -acc -ta=multicore -fast -Mvect=sse -Mcache_align -Mflushz -Mpre -Mprof=ccff -Mipa=libc -c --ZIPAimport $f $b $a -o $o' -ipaexport /tmp/pgc++TsDhPtAKHnI.ipx -exfile /tmp/pgc++vsD-XqxbpfN.ipn -exifile /tmp/pgc++9sD3b3pEFYp.ipm -ipofile /tmp/pgc++fsDpVflfxMx.ipo -ccff /tmp/pgc++DsDxmeOO-B-.ccff -asm /tmp/pgc++nsDNR9TPzob.s


2__________________
$ pgc++ -g 2.cpp -lgsl -lgslcblas -lm -mp -acc

Internal error: assertion failed: find_assoc_pragma: pragma not found
          (../src/il2.c, line 22231)
pgc++-Fatal-/opt/pgi/linux86-64/17.10/bin/pggpp2 TERMINATED by signal 6
Arguments to /opt/pgi/linux86-64/17.10/bin/pggpp2
/opt/pgi/linux86-64/17.10/bin/pggpp2 2.cpp -debug -x 120 0x200 -opt 0 -terse 1 -inform warn -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -x 59 4 -tp haswell -astype 0 -fn 2.cpp -il /tmp/pgc++gEDsPFGvB60.il -x 117 0x600 -x 123 0x80000000 -x 123 4 -x 119 0x20 -def __pgnu_vsn=60401 -x 120 0x200000 -x 70 0x40000000 -x 164 0x800000 -accel tesla -accel host -x 186 0x80000 -x 180 0x4000400 -x 163 0x1 -cudaver 7.5 -x 176 0x100 -cudacap 30 -cudacap 35 -cudacap 50 -x 121 0xc00 -x 194 0x40000 -x 189 0x8000 -y 163 0xc0000000 -x 163 0x800000 -x 189 0x10 -y 189 0x4000000 -mp -x 69 0x200 -x 69 0x400 -gnuvsn 60401 -x 69 0x200 -x 119 0x08 -cmdline '+pgc++ /tmp/pgc++gEDsPFGvB60.il -g -lgsl -lgslcblas -lm -mp -acc' -asm /tmp/pgc++2EDIZCFS1Yx.s


3________________________
$ pgc++ -fast -Mipa=libc -ta=tesla:cc50 -acc -mp -Minfo=accel -lstdc++ -lgsl -lgslcblas -lm 2.cpp

Internal error: assertion failed: find_assoc_pragma: pragma not found
          (../src/il2.c, line 22231)
pgc++-Fatal-/opt/pgi/linux86-64/17.10/bin/pggpp2 TERMINATED by signal 6
Arguments to /opt/pgi/linux86-64/17.10/bin/pggpp2
/opt/pgi/linux86-64/17.10/bin/pggpp2 2.cpp -opt 2 -terse 1 -inform warn -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -vect 56 -y 34 16 -x 34 0x8 -x 32 8388608 -y 19 8 -y 35 0 -x 42 0x30 -x 39 0x40 -x 199 10 -x 39 0x80 -x 59 4 -tp haswell -x 120 0x1000 -astype 0 -x 121 1 -fn 2.cpp -il /tmp/pgc++zPDlBCFMVFt.il -x 117 0x600 -x 123 0x80000000 -x 123 4 -x 119 0x20 -def __pgnu_vsn=60401 -autoinl 10 -x 168 400 -x 174 128000 -x 14 0x200000 -x 46 4 -x 120 0x200000 -x 70 0x40000000 -x 164 0x800000 -accel tesla -x 180 0x4000400 -x 121 0xc00 -x 163 0x1 -x 186 0x80000 -cudaver 7.5 -x 194 0x40000 -x 176 0x100 -cudacap 50 -x 186 0x80000 -x 180 0x4000400 -x 121 0xc00 -x 194 0x40000 -x 189 0x8000 -y 163 0xc0000000 -x 189 0x10 -y 189 0x4000000 -x 9 1 -x 42 0x14200000 -x 72 0x1 -x 136 0x11 -quad -x 119 0x10000000 -x 129 0x40000000 -x 129 2 -x 164 0x1000 -x 89 0x80800842 -x 66 0x20000 -x 14 0x104 -y 89 0x40 -mp -x 69 0x200 -x 69 0x400 -x 0 0x1000000 -x 2 0x100000 -x 0 0x2000000 -x 161 16384 -x 162 16384 -gnuvsn 60401 -x 69 0x200 -cmdline '+pgc++ /tmp/pgc++zPDlBCFMVFt.il -fast -Mvect=sse -Mcache_align -Mflushz -Mpre -Mipa=libc -ta=tesla:cc50 -acc -mp -Minfo=accel -lstdc++ -lgsl -lgslcblas -lm' -ipacomp pgc++ -ipacompsw '+$n -fast -Mvect=sse -Mcache_align -Mflushz -Mpre -Mipa=libc -ta=tesla:cc50 -acc -mp -Minfo=accel -lstdc++ -lgsl -lgslcblas -lm -c --ZIPAimport $f $b $a -o $o' -ipaexport /tmp/pgc++5PDRI9samYH.ipx -exfile /tmp/pgc++bPDd4jRfmxa.ipn -exifile /tmp/pgc++jPDB44iAnaJ.ipm -ipofile /tmp/pgc++XPDtGrube_d.ipo -asm /tmp/pgc++HPDJ0Fld5vP.s

MatColgrove · February 15, 2018, 12:00am

Hi tasica,

If you can, please send a reproducing example of the file that caused the error to PGI Customer Service (trs@pgroup.com).

While the cause was a user error with a misspelled pragma, the compiler should have caught this or at least not fail with an internal error. I’d like to have our engineers look at this.

Thanks,
Mat

svenGorran · February 15, 2018, 10:57pm

Hi Mat,

Next days I will make a short version of my program
Until then ATOMIC was the #pragma which I commented and than the program was giving me back errors as in traditional style.
The solution was to follow your suggestion with scalar and I avoid using ATOMIC.
In the beginning I didn’t want to use scalar because it was necessary to write more lines. Anyway, It looks than I must learn to use ATOMIC.
Also gcc was complaining at the way how I’ve written ATOMIC #pragma, something with {} symbols

double med, *****re9;
int *sir;

// if actual value (med) smaller as one already stored in position oCr, than store all new enviroment *sir, med, r1, r2

#pragma acc atomic
{
	if (med < re9[mm][j][p][oCr][t])
	{
		for (int c = 0; c < oCr; c++)
			re9[mm][j][p][c][t] = (double) (sir[c]);
		re9[mm][j][p][oCr][t] = med;
		re9[mm][j][p][oCr + 1][t] = (double)r1;
		re9[mm][j][p][oCr + 2][t] = (double)r2;
	}
}

MatColgrove · February 16, 2018, 5:01pm

Hi tasica,

Atomics can only be applied to a specific storage location and are only needed when you have collisions on that location. A collision is when multiple threads access the same location and thus could produce a race condition.

So here you’re trying to make a region atomic instead of a specific memory location. This more like an OpenMP critical section. Critical sections aren’t supported in OpenACC since OpenACC doesn’t presume that there is support for global synchronization between threads which is required for critical sections to work.

Without the full code it’s difficult to give you recommendations here, but my assumption is that you probably don’t need to use atomics. Most likely you’re parallelizing across the first, second, and/or third dimension of “re9” so the isn’t a collision on the fourth and fifth dimensions. Though for illustration, let’s assume that there’s a collision, in which case you’d do something like the following:

    
    #pragma atomic read
    tmp = re9[mm][j][p][oCr][t];

if (med < tmp) 
   { 
      for (int c = 0; c < oCr; c++) 
         re9[mm][j][p][c][t] = (double) (sir[c]); 
#pragma acc atomic write 
      re9[mm][j][p][oCr + 1][t] = (double)r1;
#pragma acc atomic write 
      re9[mm][j][p][oCr + 2][t] = (double)r2; 
   }

Note since “atomic” can slow your code down, it’s recommended to refactor algorithms to remove any collisions and only use atomics if this is not possible.

-Mat

svenGorran · February 16, 2018, 9:25pm

Mat,
thank you very much for your clarification.
Now I know from where the “strage” message was coming, I used Atomic wrong.
I hope it is not anymore necessary to sent the code.
I want to help to improve the compiler, but as a beginner it takes me a bit to write it and I also don’t have so much free time.
gcc was showing the error, pgc was terminated by signal 6
Let me know if you need the code, I will find time to write it.

second problem
I’ve noticed pgdbg give SIGSEGV signal with few lines (2, 4 sometimes) before a “#pragma acc data copy” with not contiguous array, not exactelly at the #pragama line
the message

 SIGSEGV at 0x7FF9AB906D72, function __c_mcopy1, line

says nothing about not contiguous array, I’ve just see it in gcc and I suppose is this
can be other problem? without acc the program and pgdbg are running well

#pragma acc data copyin(ve1[0:oIe][0:os], vd3[1:oIe][0:opn], vj4[1:oAc][0:ov5[o38 - 1][1]], ls7[o32:o38][ov5[o32][0]:ov5[o38 - 1][1]][0:(ol + 1)]) create(ct8[o32:o38][0:oAc][0:oCr][0:6][1:ov5[o38 - 1][1]][0:os])
    {
        #pragma acc data copyout(re9[o32:o38][(oIe - oVr):oIe][ov5[o32][0]:ov5[o38 - 1][1]][0:(oCr + 3)][0:2])

until now, as a beginner using basic acc utilities, using pgc I see advantages in messages coming at compilation with -Minfo but at errors is not straightforward

svenGorran · February 26, 2018, 6:12pm

int *CT, *RES;

void function(int m) {
  CT[];
}

int main(void) {
  int oC = 22;
  #pragma acc declare create(oC)

  #pragma acc data create(sir[0:oC], CT[])
  {
    for (m = 1; m < 5; m++)
    {
      function(m);         
      #pragma acc update device(CT[])

      for (p = 1; p < 8; p++) {
        min0 = MAX_VAL;
        min1 = MAX_VAL;

        #pragma omp parallel for firstprivate(sir) reduction(min:min0,min1)
        for (min_lu[p]; lu < max_lu[p]; lu++)
        {

          do {                //line 1192
              #pragma acc update device(sir[0:oC])

                #pragma acc parallel loop gang reduction(min:min0, min1)
                for (s = 0; s < 3; s++)
                {
                    #pragma acc loop worker reduction(min:min0)
                    for (a = 0; a < 3; a++)
                    {
                        int x = 0;
                        int *restrict pox[] ...  // 1000 elements

                        #pragma acc loop vector reduction(+:x)
                        for (i = 1; i < 365; i++)           //line 1214
                          for (v = 1; v < 10; v++) {
                            use CT, sir
                            pox[x] = ...
                            x++;
                            free(something);                         //line 1275
                          }

                       #pragma acc loop vector reduction(min:min0)
                        for (j = 1; j < 40; j++) {
                          use pox[x]
                          if (pox)                          //line 1286
                            min0 = ...
                        }

                      free(pox);
                }
            }

          } while(sir) //dependence

        }  // end "lu" omp

        RES[m][p][0] = min0;
        RES[m][p][1] = min1;
      }  // end "p"
    } // end "m"
  }
}

Hi Mat,
In some posts I saw “omp+acc possible but a bit tricky”.
I am back on this topic because at that time I didn’t know what is important to ask and what not.
Unfortunatelly I can only explain, no code

The reasons why “omp+acc” I think fits my case:

inter-dependency in do-while loops, so I use “omp” outside of do-wehile and “acc” inside do-while.
the RES-uslts are depending only on “m” and “p”, all other loops can be executed completlly in parallel
I don’t have big workload in the device, means with “omp” I can increase occupacy.

Compiling only with “omp” the code is working.
But there is a difference in the -Minfo output (2 more lines) when I compile “acc+omp” compare with only “acc”.

1213, 1334 Accelerator restriction: size of the GPU copy of "sir" is unknown.

Question_1
Why size of “sir” is unknown when I add “omp”?
“sir” is acc “declare create” global, “firstprivate” in “omp” and updated before acc starts, I think it is enough.

with regards of reduction(min:min0,min1)
Question_2
Is there any possibility to count how many times min0 and min1 are reached ?
I can do again the loops and count them, but it means double the time.

Just fixing/clearing my knowledges.
Question_3
Where should I apply *restrict to my arrays ?
This question is in connection with the topic below where you added *restrict at the declaration and I don’t know why.
Do you have a hint or can I see in the output which array wants/needs *restrict ?

I’ve done that question in a separat topic, I hoped beginners like me can use the title in searching

main:
   1146, Generating copyin(gvd3[:(u1-v1)*(u2-v2)])
         Generating copyout(gre9[:(h1-k1)*((h2-k2)*((h3-k3)*((h4-k4)*(h5-k5))))])
         Generating copyin(gvj4[:(p1-b1)*(p2-b2)],gve1[:(f1-l1)*(f2-l2)])
         Generating create(gct8[:(e1-a1)*((e2-a2)*((e3-a3)*((e4-a4)*(e5-a5))))])
         Generating copyin(gls7[:(m1-n1)*((m2-n2)*(m3-n3))])
   1192, Generating copyin(p2,oSt,oIe,ol,oVr,n3,oCr,a4,a3,a2,a1,e4,e3,e2,e1,b2,n2,n1,m3,m2,m1,l2,u1,sir[:oCr],v2,v1,u2,b1,a5,p1,os,l1,k5,f1,e5,k4,k3,k2,k1,h5,h4,h3,h2,h1,f2)
         Generating update device(sir[:oCr])
         Accelerator kernel generated
         Generating Tesla code
       1192, Generating reduction(min:mY,mX)
             Vector barrier inserted due to potential dependence into a vector loop
       1197, #pragma acc loop gang /* blockIdx.x */
       1200, #pragma acc loop worker(4) /* threadIdx.y */
             Generating reduction(min:mX)
       1205, #pragma acc loop seq
       1207, #pragma acc loop seq
       1211, #pragma acc loop vector(32) /* threadIdx.x */
             Generating reduction(+:x)
             Vector barrier inserted to share data across vector lanes
       1220, #pragma acc loop seq
       1227, #pragma acc loop seq
       1229, #pragma acc loop seq
       1231, #pragma acc loop seq
       1235, #pragma acc loop seq
       1241, #pragma acc loop seq
       1248, #pragma acc loop seq
       1250, #pragma acc loop seq
       1257, #pragma acc loop seq
       1259, #pragma acc loop seq
       1265, #pragma acc loop seq
       1267, #pragma acc loop seq
       1271, Generating implicit reduction(+:x)
       1275, Vector barrier inserted due to potential dependence out of a vector loop
             Vector barrier inserted to share data across vector lanes
       1281, #pragma acc loop vector(32) /* threadIdx.x */
             Generating reduction(min:mX)
       1284, #pragma acc loop seq
       1286, Vector barrier inserted due to potential dependence out of a vector loop
   1200, Loop is parallelizable
   1205, Loop is parallelizable
   1207, Loop is parallelizable
   1211, Loop is parallelizable
   1220, Loop is parallelizable
   1227, Loop is parallelizable
   1229, Loop is parallelizable
   1231, Loop is parallelizable
   1235, Loop is parallelizable
   1241, Loop is parallelizable
   1248, Loop is parallelizable
   1250, Loop is parallelizable
   1257, Loop is parallelizable
   1259, Loop is parallelizable
   1265, Loop is parallelizable
   1267, Loop is parallelizable
   1281, Loop is parallelizable
   1284, Loop is parallelizable

MatColgrove · February 26, 2018, 7:21pm

Hi tasica,

Question_1
Why size of “sir” is unknown when I add “omp”?
“sir” is acc “declare create” global, “firstprivate” in “omp” and updated before acc starts, I think it is enough.

“sir” is a OpenMP firstprivate array meaning that each OpenMP thread will have it’s own private copy of the array. Hence, the “sir”, being used in the OpenMP region is different from the global “sir”. Try moving the OpenACC data directive into the OpenMP for loop so that each thread’s “sir” gets created and copied over to the device.

Question_2
Is there any possibility to count how many times min0 and min1 are reached ?
I can do again the loops and count them, but it means double the time.

Not within the min reduction itself, but you proabably could add a second “sum” reduction to count the number of times it enters the if statement. I haven’t tried this myself, though I’m thinking that you might need to have a second if statement for the sum since the idiom recognition for is looking for an "if minim … min = " structure.

Try:

#pragma acc loop vector reduction(min:min0) reduction(+:mincnt)
                      for (j = 1; j < 40; j++) 
                        if minim  {
                           mincnt += 1;
                          min0=.... 
                       }

or if that doesn’t work:

#pragma acc loop vector reduction(min:min0) reduction(+:mincnt)
                      for (j = 1; j < 40; j++) {
                        if minim  
                           mincnt += 1;  
                         if minim
                            min= ...
                     }

Question_3
Where should I apply *restrict to my arrays ?

In the declaration of the pointers.

int * restrict CT, * restrict RES;

In C, different pointers of the same type are allowed to point at the same object (i.e. aliased) and therefore may cause a dependency. The C99 restrict keyword asserts to the compiler that the pointer does not alias another object.

See: restrict type qualifier (since C99) - cppreference.com for details about “restrict”

Hope this helps,
Mat

svenGorran · February 27, 2018, 9:38pm

Hi Mat,

Yes, it was very helpful, thank you for the suggestions !!
My program started to work.

I have updated “the logical sketch” and the “output” in the post above, I will refer to there.
It remains only 3 lines in the output which I don’t know how to pass them.

lines 1211 1275 1286 Vector barrier inserted

Do not care about line 1192, there the message is coming from do-while loop which has dependence - no problem.
In my opinion all 3 errors are coming from array “pox”.
“pox” has a dimension of (1000) because I don’t know in the beginning how much to allocate.
“x” is keeping this information - how many elements has “pox”
I’ve done reduction for “x”, but I don’t know what to do with “pox”
the 2 errors in line 1275 is pointing at a “free(something)” which is other array, but I think they belongs to “pox”, it is like repeating the errors in the end of the loops
Question_1
What should I write for “pox” since for that loop vector “reduction type not supported” and “shared is invalid text in pragma” ?

Until now I helped myself with error messages from gcc because it is straightforward comparison with pgc. Since I added *restrict to the arrays in the device, gcc gives all *restrict as an error. I am a bit confused.
Question_2
why is this happened ? should be recognize by gcc too. Not so important this question, just as information.

If the situation is not easy to understand from my explanation, let me know and I will try to make it clear.

Bellow are the running messages, for me not big help now, but maybe for you is helpful.

$ ./a.out
p = 6 i = 0 p = 6 i = 1 p = 6 i = 2 
Accelerator Kernel Timing data
/home/a/Documents/3.cpp
  main  NVIDIA  devicenum=0
    time(us): 14,596
    1146: data region reached 1 time
        1146: data copyin transfers: 4
             device time(us): total=49 max=26 min=5 avg=12
    1192: data region reached 10 times
        1192: data copyin transfers: 135
             device time(us): total=484 max=9 min=3 avg=3
    1192: update directive reached 874 times
        1192: data copyin transfers: 874
             device time(us): total=3,556 max=21 min=3 avg=4
    1192: compute region reached 874 times
        1192: data copyin transfers: 874
             device time(us): total=2,651 max=13 min=3 avg=3
        1192: kernel launched 874 times
            grid: [3]  block: [32x4]
            elapsed time(us): total=1,781,626 max=2,397 min=1,841 avg=2,038
        1192: reduction kernel launched 873 times
            grid: [2]  block: [256]
            elapsed time(us): total=12,574 max=32 min=13 avg=14
        1192: data copyout transfers: 873
             device time(us): total=7,856 max=26 min=8 avg=8
Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

MatColgrove · February 28, 2018, 6:17pm

Hi tasica,

Question_1
What should I write for “pox” since for that loop vector “reduction type not supported”

Let’s review what you have posted for “pox”, since there are a number of issues here:

                     #pragma acc loop worker reduction(min:min0) 
                     for (a = 0; a < 3; a++) 
                     { 
                         int x = 0; 
                         int *restrict pox ...  // 1000 members 

                         #pragma acc loop vector reduction(+:x) 
                         for (i = 1; i < 365; i++)           //line 1214 
                           for (v = 1; v < 10; v++) { 
                             use CT, sir 
                             pox[x] = ... 
                             x++; 
                             free();                         //line 1275 
                           } 

                        #pragma acc loop vector reduction(min:min0) 
                         for (j = 1; j < 40; j++) { 
                           use pox[x] 
                           if (pox)                          //line 1286 
                             min0 = ... 
                         } 

                     free(pox); 
                 }

First, let me explain how reductions work. Basically by putting the “x++” you’ve caused a dependency in the loop since loop iterations depends on what the value of x was in the previous iteration before it can be updated. However with reductions, the loop can be parallelized by having each thread make a private copy of x, doing a partial reduction on it’s copy, then after the loop, a final reduction is performed on the partial reductions.

But here, you’re using the value of x as the index to pox. If you use a reduction, you’d be using the private x not the shared and hence be updating the wrong element in pox.

If the order in which the values of pox are added doesn’t matter, you could try using an atomic capture where the shared variable is updated atomically. Something like:

int xx;
#pragma acc atomic capture
xx = x++
pox[xx] = …

Atomic operations can be slow, however, so you may loose some performance.

Question_1
and “shared is invalid text in pragma” ?

I don’t see what you have a “shared” clause in your example, but “shared” isn’t an OpenACC clause. It is an OpenMP clause, so perhaps you’re trying to mix the two?

Until now I helped myself with error messages from gcc because it is straightforward comparison with pgc. Since I added *restrict to the arrays in the device, gcc gives all *restrict as an error. I am a bit confused.
Question_2
why is this happened ? should be recognize by gcc too.

“restrict” is part of C99. Up until recently gcc defaults to using C89. Try setting “-std=cc99” to get gcc to recognize restrict.

From your run output, it looks like your code is getting an illegal address error and crashing:

call to cuMemFreeHost returned error 700: Illegal address during

Without the full code, I can’t know for sure where or why this occurs. You can try setting the environment variable “PGI_ACC_NOTIFY=1” to have the compiler show you all the kernel launches. You can then see the last one that was executed. There’s also “PGI_ACC_DEBUG=1” which will give you very detailed information about all OpenACC runtime calls being made.

It does look like you are allocating memory from within device code. Keep in mind that there’s a very small heap (~8MB) on the device which can be consumed very quickly. If you run out of heap space, this can cause the program to crash with this error. I don’t know for sure if this is happening here, but you can try increasing the heap size via the environment variable “PGI_ACC_CUDA_HEAPSIZE=”.

Another possible issue is that you note that por is 1000 elements but it looks like x gets updated for every iteration of the i and v loops which is 3650. This would cause you to be writing off the end of the por array and thus cause an illegal address error.

Also, why is there an empty “free()” statement at line 1275?

Note that without a full reproducing example, much of my answers are just educated guesses. If you are still stuck, having you post the full code, or send a copy to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me would be helpful.

Though before doing this, be sure your program is correct on the host (i.e when not using OpenACC). For example if you are accessing por outside it’s bounds, this error would also appear in the host version. It doesn’t always cause the program to crash (it depends upon what extra memory you’re writing in) so using utilities such as Valgrind to debug the code are very useful.

-Mat

svenGorran · March 4, 2018, 2:18pm

Hi Mat,

Maybe my experience helps improving pgc.
For different reasons you are the only source of learning openacc, after every discussion with you I’ve made good steps in learning “acc”.
Few remarks:

openacc has a fast development, many things which I found in Google/Youtube (the traditional way of learning) were obsolete, deprecate (ex #pragma acc region …)
I thought “lets follow the latest version of the pdf guides”, So I’ve started to read last versions of Guides from openacc.org and pgroup.com. The results were better than first way, but missing or poor examples for directives and clauses.
there are also differences between what openacc2.5/2.6 require and what is already implemented in pgc and what not. I also saw users in forum asking you about some clauses and your answer “is work in progress”, I am not alone in using wrong things.
“lets read the posts in users forum” The results were better than second way, but still not good. The reasons are, to find exactly what I’m interested in, it takes long because sometimes the titles are not suggestive or the post is going in different direction from were it starts with 100 posts.
pgc compiler is not perfect in explaining errors and an example can help to avoid errors.

So, I am missing an up to date overview, like an excel table or webpage where I can find:

all directives 			all clause  introduced			deprecated		example C code source
----------------------------------------------------------------------------------------------
#pragma acc region			      	version 10.0		version 14.0	obsolete-no example

#pragma acc atomic 		write		version 10.0						  link to example
                   	capture	  version 12.0						 same link to same example
						    	link		 work in progress					 no example yet

“link to the example”, one example can have few clauses (as commented lines).
This helps me not to use obsolete directives and see an example how to use a specific directive+clause to avoid errors.
“Reference Guide” has a bit of table but no example and it is a book.
A table is much structured as a book and engineers like graphs/tables than reading pages.
If you already have a clear/simply way how to learn acc and I don’t know, please excuse me for writing this story and guide me to it.
May I use email for comments at 2.cpp ? Or should I ask here ?

svenGorran · March 7, 2018, 12:32pm

Hi Mat,

I have tried to implement “#pragma atomic capture” following the suggestions from forum.
I receive “nvvmCompileProgram error: 9” and I gave up.
Now I’ve installed all new software (cuda9.1 and patches+Driver Version:390.30) because I saw some posts about nvvm saying “old drivers”.
But I have the same error compiling 2.cpp
If it works in your computer, maybe my compiler is still missing something.

Q_1
I don’t know how to jump over this error
“Error: 2.cpp(292) Error: unsupported operation”
is pointing at the “xx = x++;”

	int xx;
	#pragma acc atomic capture 
	xx = x++;

bellow is the output from 2.cpp and the link shows the compiler configuration
http://textuploader.com/dxfsz

main:
    191, Generating copyin(ls7[:(m1-n1)*(m2-n2)],vj4[:(p1-b1)*(p2-b2)],ve1[:(f1-l1)*(f2-l2)],ct8[:(e1-a1)*((e2-a2)*((e3-a3)*((e4-a4)*(e5-a5))))])
    205, Generating enter data create(sir[:22])
    224, Generating update device(sir[:22])
         Generating present(sir[:])
         Accelerator kernel generated
         Generating Tesla code
        224, Generating reduction(min:mX)
        227, #pragma acc loop gang collapse(2) /* blockIdx.x */
        230,   /* blockIdx.x collapsed */
        234, #pragma acc loop vector(128) collapse(2) /* threadIdx.x */
        235,   /* threadIdx.x collapsed */
        239, #pragma acc loop vector(128) /* threadIdx.x */
        243, #pragma acc loop seq
        248, #pragma acc loop seq
        249, #pragma acc loop seq
        250, #pragma acc loop seq
        256, #pragma acc loop seq
        261, #pragma acc loop seq
        268, #pragma acc loop seq
        272, #pragma acc loop seq
        273, #pragma acc loop seq
        280, #pragma acc loop seq
        282, #pragma acc loop seq
        287, #pragma acc loop seq
        288, #pragma acc loop seq
        301, #pragma acc loop vector(128) /* threadIdx.x */
             Generating reduction(min:mX)
    234, Loop is parallelizable
    235, Loop is parallelizable
    239, Loop is parallelizable
    243, Loop is parallelizable
    248, Loop carried reuse of mx-> prevents parallelization
    249, Loop carried dependence of mx-> prevents parallelization
         Loop carried backward dependence of mx-> prevents vectorization
    250, Loop is parallelizable
    256, Accelerator restriction: size of the GPU copy of tm,mx is unknown
         Loop is parallelizable
    261, Accelerator restriction: size of the GPU copy of tm is unknown
         Loop is parallelizable
    268, Accelerator restriction: size of the GPU copy of ti is unknown
         Loop is parallelizable
    272, Loop carried reuse of ti-> prevents parallelization
         Loop carried scalar dependence for k at line 276,275
    273, Accelerator restriction: size of the GPU copy of tm,ti is unknown
         Loop carried reuse of ti-> prevents parallelization
         Loop carried scalar dependence for k at line 276,275
    280, Accelerator restriction: size of the GPU copy of ix is unknown
         Loop is parallelizable
    282, Accelerator restriction: size of the GPU copy of ti,ix is unknown
         Loop is parallelizable
    287, Loop carried reuse of pox-> prevents parallelization
    288, Accelerator restriction: size of the GPU copy of pox is unknown
         Complex loop carried dependence of pox-> prevents parallelization
         Loop carried reuse of pox-> prevents parallelization
    301, Loop is parallelizable
    321, Generating exit data delete(sir[:1])
nvvmCompileProgram error: 9.
Error: 2.cpp(292) Error: unsupported operation
PGCC-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (2.cpp: 1)
PGCC/x86 Linux 17.10-0: compilation aborted

MatColgrove · March 7, 2018, 8:13pm

Hi Tascia,

Q_1
I don’t know how to jump over this error
“Error: 2.cpp(292) Error: unsupported operation”
is pointing at the “xx = x++;”

I don’t see this error in the version of the code I sent you. Did you modify where “x” is declared or make other changes?

Only global variables or gang private (i.e. CUDA Shared) variables can used as atomic variables. If “x” is a local variable, you can get this error.

-Mat

svenGorran · March 19, 2018, 3:06pm

Hi Mat,

(as usually) I had few mistakes combined which gave that message (wrong instalation + compiling options + error 999)
Now I have cleaned-up them and I would like you to have a look on the code.
Because are 3 txt files of entry data, may I use the email to send the code?

MatColgrove · March 19, 2018, 3:55pm

Sure. Though I’m a bit swamped with some major projects so may not have much time to spend on it. I’ll do my best.

-Mat

Topic		Replies	Views
Need advice for OpenACC directives Legacy PGI Compilers	6	7371	July 5, 2016
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1634	December 4, 2020
understanding problems with acc directives. Legacy PGI Compilers	7	12749	May 3, 2010
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25228	December 11, 2014
Accelerator clause: upper bound for dimension 0 of array Legacy PGI Compilers	7	4183	May 24, 2018
Parallelizing a loop Legacy PGI Compilers	9	5568	March 1, 2016
OPENACC changes value of array Legacy PGI Compilers	12	9808	May 17, 2016
Code execution depends strangely on irrelevant parameters Legacy PGI Compilers	8	8158	October 22, 2013
cuModuleGetGlobal error Legacy PGI Compilers	12	6676	December 21, 2012
How to parallelize this loop... Legacy PGI Compilers	14	7936	December 18, 2012

paralellize some loops with omp + acc

Related topics