reduction within "!$acc kernels loop" ?

JMa1 · January 11, 2013, 12:17am

Hi All,

Is reduction allowed within !$acc kernels loop ?
I tried to compile the following small code but got many errors.
I know it will work by replacing “kernels” with “parallel” in the $acc line, but as found in my previous post, “$acc kernels” perfoms much much faster than “$acc parallel” …

So it will be nice if it can work with “$acc kernels loop”.

CODE:
tmp=0.d0
call system_clock(count1, count_rate, count_max)
!$acc kernels loop reduction(+:tmp)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
tmp=tmp+1.d0
enddo
enddo
enddo

print*, ‘iteration#:’,tmp

call system_clock(count2, count_rate, count_max)
write(,)‘GPU costs’,(count2-count1),‘micronseconds’

Thanks,
JMa

MatColgrove · January 11, 2013, 4:56pm

Is reduction allowed within !$acc kernels loop ?

Yes. Though, I’ve complained to our compiler engineers since we don’t print a feedback message when a reduction clause is used. We do when the compiler automatically generates the reduction, but just not when it’s made explicit. They’ll get it fixed.

Here’s the output after I remove the “reduction” clause and use just “!$acc kernels loop”. The that the sum reduction is generated for tmp.

% pgf90 test.f90 -acc -ta=nvidia,4.2,keepgpu -Minfo=accel
testsub:
      7, Generating present_or_copy(c(:,:))
         Generating present_or_copyin(a(:,:))
         Generating present_or_copyin(b(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
      9, Loop is parallelizable
     10, Loop is parallelizable
     11, Complex loop carried dependence of 'c' prevents parallelization
         Loop carried dependence of 'c' prevents parallelization
         Loop carried backward dependence of 'c' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
          9, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         10, !$acc loop gang ! blockidx%y
         11, CC 1.3 : 16 registers; 64 shared, 32 constant, 0 local memory bytes
             CC 2.0 : 23 registers; 0 shared, 76 constant, 0 local memory bytes
         13, Sum reduction generated for tmp

Mat

JMa1 · January 11, 2013, 5:15pm

Hi Mat,
Nice to see you online and very happy to see your reply.

I tried to recompile it by removing "reduction(+:tmp), however, the parallelization generation failed by reporting:
30, Generating present_or_copy(c(:,:))
Generating present_or_copyin(a(:,:))
Generating present_or_copyin(b(:,:))
31, Loop carried scalar dependence for ‘tmp’ at line 35
Scalar last value needed after loop for ‘tmp’ at line 40
Accelerator restriction: scalar variable live-out from loop: tmp
Accelerator scalar kernel generated
32, Loop carried scalar dependence for ‘tmp’ at line 35
Scalar last value needed after loop for ‘tmp’ at line 40
Accelerator restriction: scalar variable live-out from loop: tmp
33, Complex loop carried dependence of ‘c’ prevents parallelization
Loop carried dependence due to exposed use of ‘c(i1+1,i2+1)’ prevents parallelization
Loop carried scalar dependence for ‘tmp’ at line 35
Scalar last value needed after loop for ‘tmp’ at line 40
Accelerator restriction: scalar variable live-out from loop: tmp

In addition, when I have reduction explicitly stated with kernels, I got many errors:

------ Rebuild All started: Project: 2ndOpenACC, Configuration: Debug x64 ------
Deleting intermediate and output files for project ‘2ndOpenACC’, configuration ‘Debug’
Compiling Project …
…\2ndOpenACCProgram.cuf
C:\Users.…\2ndOpenACCProgram.cuf(50) : warning W0093 : Type conversion of expression performed
C:\Users.…\2ndOpenACCProgram.cuf(50) : warning W0093 : Type conversion of expression performed
0 inform, 2 warnings, 0 severes, 0 fatal for example1
C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(100): error: expected an identifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(100): error: expected a “)”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(98): error: attribute “global” does not apply here

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(98): error: attribute “launch_bounds” does not apply here

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(105): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(128): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(129): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(130): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(130): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(131): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(131): error: identifier “S108” is undefined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(132): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(132): error: a value of type “float *” cannot be used to initialize an entity of type “int”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(133): error: expected an identifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(133): error: a value of type “float *” cannot be used to initialize an entity of type “int *”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(133): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(134): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(135): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(136): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(137): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(138): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(138): error: variable “rc4” has already been defined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(139): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(140): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(140): error: variable “rc4” has already been defined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(141): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(141): error: variable “rc4” has already been defined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(142): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(142): error: variable “rc4” has already been defined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(143): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(143): error: variable “rc4” has already been defined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(144): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(144): error: variable “rc4” has already been defined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(145): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(145): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(146): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(147): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(147): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(148): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(149): error: explicit type is missing (“int” assumed)

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(149): error: cannot overload functions distinguished by return type alone

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(150): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(150): error: variable “rc4” has already been defined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(151): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(152): error: expected an identifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(152): error: a value of type “int” cannot be used to initialize an entity of type “int *”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(152): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(153): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(153): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(154): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(154): error: variable “rc5” has already been defined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(155): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(156): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(156): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(157): error: explicit type is missing (“int” assumed)

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(157): error: cannot overload functions distinguished by return type alone

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(158): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(159): error: expected an identifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(159): error: a value of type “int” cannot be used to initialize an entity of type “int *”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(159): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(160): error: explicit type is missing (“int” assumed)

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(160): error: cannot overload functions distinguished by return type alone

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(161): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(162): error: expected an identifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(162): error: a value of type “int” cannot be used to initialize an entity of type “int *”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(162): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(163): error: explicit type is missing (“int” assumed)

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(163): error: cannot overload functions distinguished by return type alone

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(164): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(165): error: expected an identifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(165): error: a value of type “int” cannot be used to initialize an entity of type “int *”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(165): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(166): error: explicit type is missing (“int” assumed)

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(166): error: cannot overload functions distinguished by return type alone

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(167): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(168): error: expected an identifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(168): error: a value of type “int” cannot be used to initialize an entity of type “int *”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(168): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(169): error: explicit type is missing (“int” assumed)

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(169): error: cannot overload functions distinguished by return type alone

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(170): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(171): error: expected an identifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(171): error: a value of type “int” cannot be used to initialize an entity of type “int *”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(171): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(172): error: explicit type is missing (“int” assumed)

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(172): error: cannot overload functions distinguished by return type alone

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(173): error: expected a declaration

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(174): error: expected an identifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(174): error: a value of type “int” cannot be used to initialize an entity of type “int *”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(174): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(175): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(175): error: variable “b1” has already been defined

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(175): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(176): error: this declaration has no storage class or type specifier

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(176): error: expected a “;”

C:\Users.…\pgcudafor2a4C6bORi-hOB3.gpu(177): error: expected a declaration

96 errors detected in the compilation of “C:\Users.…\pgnvd2a2quIZC_nhXn.nv0”.
2ndOpenACC build failed.

========== Rebuild All: 0 succeeded, 1 failed, 0 skipped ==========

It seems I got different compiling results as yours… Do you know why this happen?

Thanks and have a nice day,

Jingsen

MatColgrove · January 11, 2013, 5:23pm

Hi Jingsen,

Can you post your full example to ensure we’re compiling the same thing? Also, which compiler version are you using? The second error with the explicit use of the reduction clause looks like a bug where we’re generating bad CUDA code. Once I have your example, I’ll investigate it.

Mat

JMa1 · January 11, 2013, 5:37pm

Hi Mat,
The version was downloaded on Dec. 20, 2012, with Intel VF shell 2010: “pgivfx64-vs2010-1210.exe”.

Following is the sample program.

Thanks,
Jingsen

! matrix-acc.f
program example1

parameter ( n_size=2000 )
real8, dimension(:,:) :: a(n_size,n_size)
real8, dimension(:,:) :: b(n_size,n_size)
real8, dimension(:,:) :: c(n_size,n_size)
real8, dimension(:,:) :: d(n_size,n_size)
character(10) :: time
real tmp
integer count1, count2, count_rate, count_max

! Initialize matrices (values differ from C version)
do i=1, n_size
do j=1, n_size
a(i,j) = i + j;
b(i,j) = i - j;
enddo
enddo
c=0.d0
d=0.d0

tmp=0.d0
call system_clock(count1, count_rate, count_max)
!$acc kernels loop !reduction(+:tmp)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
tmp=tmp+1.d0
enddo
enddo
enddo

print*, ‘iternation#:’,tmp

call system_clock(count2, count_rate, count_max)
write(,)‘GPU costs’,(count2-count1),‘micronseconds’

tmp=0.d0
call system_clock(count1, count_rate, count_max)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
d(i,j) = d(i,j) + a(i,k)b(k,j)
tmp=tmp+1.d0
enddo
enddo
enddo
call system_clock(count2, count_rate, count_max)
write(,*)‘CPU costs’,(count2-count1),‘micronseconds’

! check the results
do i=1, n_size
do j=1, n_size
if( c(i,j) .ne. d(i,j) )then
print *, i,j, c(i,j), d(i,j)
stop ‘error found’
endif
enddo
enddo
print , n_sizen_size, ‘iterations completed’

end program example1

JMa1 · January 11, 2013, 6:00pm

Hi Mat,
I found the solution: just very simpley by changing Line 11:
from
real tmp

to

real*8 tmp

Now the compilor reports:
27, Generating present_or_copy(c(:,:))
Generating present_or_copyin(a(:,:))
Generating present_or_copyin(b(:,:))
28, Loop is parallelizable
29, Loop is parallelizable
30, Complex loop carried dependence of ‘c’ prevents parallelization
Loop carried dependence of ‘c’ prevents parallelization
Loop carried backward dependence of ‘c’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
28, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
29, !$acc loop gang ! blockidx%y
32, Sum reduction generated for tmp

It is very interesting to see this tiny change makes huge difference, however, I’m still confused and curious about why it is like this?
Do you have any thoughts?

Thanks,
Jingsen

MatColgrove · January 11, 2013, 6:10pm

Hi Jingsen,

The first error (i.e. without the reduction clause) is being caused by the data type mismatch. Please declare tmp as DOUBLE PRECISION,or change “1.d0” to “1.0”.

As for the second error, the problem being that the “k” loop is being sequentially executed so each kernel needs to sum up multiple “tmp” values. We handle it correctly when we auto-generate the reduction, but obviously not when the reduction clause is being used. I’ll write-up a report.

Mat

JMa1 · January 11, 2013, 6:30pm

Hi Mat,
For the 1st error, I also tried:
integer tmp, and consistently: tmp=tmp+1
Now I ran into another issue, by getting wrong answer of “tmp”:

iternation#: 2001
GPU costs 1045000 micronseconds

The compiling results:

27, Generating present_or_copy(c(:,:))
Generating present_or_copyin(a(:,:))
Generating present_or_copyin(b(:,:))
28, Loop is parallelizable
29, Loop is parallelizable
30, Complex loop carried dependence of ‘c’ prevents parallelization
Loop carried dependence of ‘c’ prevents parallelization
Loop carried backward dependence of ‘c’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
28, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
29, !$acc loop gang ! blockidx%y
32, Accelerator restriction: multilevel induction variable: tmp
Accelerator restriction: induction variable live-out from loop: tmp

The program now looks like:

! matrix-acc.f
program example1

parameter ( n_size=2000 )
real8, dimension(:,:) :: a(n_size,n_size)
real8, dimension(:,:) :: b(n_size,n_size)
real8, dimension(:,:) :: c(n_size,n_size)
real8, dimension(:,:) :: d(n_size,n_size)
character(10) :: time
integer tmp
integer count1, count2, count_rate, count_max

! Initialize matrices (values differ from C version)
do i=1, n_size
do j=1, n_size
a(i,j) = i + j;
b(i,j) = i - j;
enddo
enddo
c=0.d0
d=0.d0

tmp=1
call system_clock(count1, count_rate, count_max)
!$acc kernels loop !reduction(+:tmp)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
tmp=tmp+1
enddo
enddo
enddo

print*, ‘iternation#:’,tmp

call system_clock(count2, count_rate, count_max)
write(,)‘GPU costs’,(count2-count1),‘micronseconds’

tmp=0
call system_clock(count1, count_rate, count_max)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
d(i,j) = d(i,j) + a(i,k)b(k,j)
tmp=tmp+1
enddo
enddo
enddo
call system_clock(count2, count_rate, count_max)
write(,*)‘CPU costs’,(count2-count1),‘micronseconds’

! check the results
do i=1, n_size
do j=1, n_size
if( c(i,j) .ne. d(i,j) )then
print *, i,j, c(i,j), d(i,j)
stop ‘error found’
endif
enddo
enddo
print , n_sizen_size, ‘iterations completed’

end program example1

MatColgrove · January 11, 2013, 6:53pm

Ok, so it looks like that with multi-level reductions we’re not really handling them well and that we’re just getting lucky when the variable is a real. The next step would be to add multiple summation variables:

tmp=0
call system_clock(count1, count_rate, count_max)
!$acc kernels loop private(tmp2) 
do i=1, n_size
do j=1, n_size
tmp2 = 0
do k = 1, n_size
tmp2 = tmp2+1
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
tmp=tmp+tmp2
enddo
enddo

I’ll add this to my report.

Thanks,
Mat

Topic		Replies	Views
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	10958	November 27, 2013
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1546	December 4, 2020
cuModuleGetGlobal error Legacy PGI Compilers	12	6587	December 21, 2012
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9681	December 7, 2020
Call in OpenACC region to procedure 'pgf90_copy_f90_argl' Legacy PGI Compilers	10	11411	July 5, 2017
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2513	December 4, 2023
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25097	December 11, 2014
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13558	December 21, 2012
MatMul with openACC Legacy PGI Compilers	7	13046	December 17, 2012
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	74	December 28, 2024

reduction within "!$acc kernels loop" ?

Related topics