How to correctly initialize values for reductions

Hi,

I have a question regarding the initialization of values for reductions. What is the easiest way to initialize the values. Is my understanding correct that in the program below the variable rtmp is set to zero on the host side and for this reason it must be copied in again explicitly to the device? Is there any way to do this without the copyin directive again?

Thank you for your help.

             program testdaxpy
	        use cudafor
	        use accel_lib
	        implicit none
	        integer, parameter :: N = 1000
	        integer ierr, npe0, iam0, numdevices,i
	        real*8 :: x(N), y(N), a, rtmp
	        real, device :: x_d(N), y_d(N)
	        type(dim3) :: grid, tBlock
	     
	        numdevices = acc_get_num_devices(acc_device_nvidia)
	        if(numdevices.ne.0)then
	         call acc_set_device_num(mod(iam0,numdevices),acc_device_nvidia)
	        endif

	        x = 1.0d0
	        y = 2.0d0
	        ! main loop
	        !$acc data copyin(rtmp,x,y)
	        do i=1,N
	          ! set the value to zero (on the CPU or device?)
	          rtmp = 0.0d0

	          !$acc data copyin(rtmp)
	          call testKernel(N,rtmp,x,y)
	          !$acc end data

	        enddo
	        !$acc end data 
	      
	      end program testdaxpy

	      subroutine testKernel(N,rtmp,x,y)
	        implicit none 
	        integer N, i
	        real*8 x(N), y(N), rtmp

	        !$acc kernels present(x,y,rtmp)
	        do i=1,N
	           rtmp = rtmp + x(i) * y(i)
	        enddo
	        !$acc end kernels
	      end subroutine testKernel

Hi Peter,

Per the OpenACC specification (See section 2.5.13 https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC.3.0.pdf), reduction variables have an initial value (ex. 0 for sum, 1 for product), and then the result is added/multiplied/etc. to the original variable after the end of the reduction.

Since you have the sum reduction variable in a data region, the result will be added to the device copy of the variable not the host. Though given rtmp is copied to the device before it’s initialized to zero (the second data region is ignored given rtmp is already present on the device), the device rtmp is uninitialized and not reset to zero for each iteration of the “i” loop.

Try something like the following:

% cat red.f90
             program testdaxpy
                use openacc
                implicit none
                integer, parameter :: N = 1000
                integer ierr, npe0, iam0, numdevices,i
                real*8 :: x(N), y(N), a, rtmp

                iam0=0
                numdevices = acc_get_num_devices(acc_device_nvidia)
                if(numdevices.ne.0)then
                 call acc_set_device_num(mod(iam0,numdevices),acc_device_nvidia)
                endif

                x = 1.0d0
                y = 2.0d0
                ! main loop
                !$acc data copyin(rtmp,x,y)
                do i=1,10
                  ! set the value to zero (on the CPU or device?)
                  rtmp = 0.0d0
                  !$acc update device(rtmp)
                  call testKernel(N,rtmp,x,y)
                  !$acc update host(rtmp)
                  print *, "i=",i," rtmp=",rtmp
                enddo
                !$acc end data

              end program testdaxpy

              subroutine testKernel(N,rtmp,x,y)
                implicit none
                integer N, i
                real*8 x(N), y(N), rtmp

                !$acc kernels present(x,y)
                !$acc loop reduction(+:rtmp)
                do i=1,N
                   rtmp = rtmp + x(i) * y(i)
                enddo
                !$acc end kernels
              end subroutine testKernel
% pgfortran -ta=tesla red.f90 -Minfo=accel; a.out
testdaxpy:
     17, Generating copyin(rtmp,y(:),x(:)) [if not already present]
     21, Generating update device(rtmp)
     23, Generating update self(rtmp)
testkernel:
     35, Generating present(y(:),x(:))
         Generating implicit copy(rtmp) [if not already present]
     37, Loop is parallelizable
         Generating Tesla code
         37, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             Generating reduction(+:rtmp)
 i=            1  rtmp=    2000.000000000000
 i=            2  rtmp=    2000.000000000000
 i=            3  rtmp=    2000.000000000000
 i=            4  rtmp=    2000.000000000000
 i=            5  rtmp=    2000.000000000000
 i=            6  rtmp=    2000.000000000000
 i=            7  rtmp=    2000.000000000000
 i=            8  rtmp=    2000.000000000000
 i=            9  rtmp=    2000.000000000000
 i=           10  rtmp=    2000.000000000000

Note that manually managing “rtmp” is optional. If not in a data region, the result of the reduction would implicitly be added to the host variable.

Also, I removed the CUDA Fortran portions of the code since they were unused and changed “use accel_lib” to “use openacc”. “accel_lib” is fine to use, but contains PGI OpenACC API extensions which your’re not using. “use openacc” is the better option for portability.

Hope this helps,
Mat

Thank you! It worked!

Is it better to avoid the manual management of single value variables? If I remove rtmp from the data clause the compiler will automatically generate a copyin wherever it is used and automatically update the variable on the host side?

If I remove rtmp from the data clause the compiler will automatically generate a copyin wherever it is used and automatically update the variable on the host side?

Correct. The result from the reduction will be implicitly copied out and added to the host variable. You do not need to manually manage reduction variables.

Is it better to avoid the manual management of single value variables?

Depends on context but in most cases it doesn’t matter.

The copyout of the reduction variable does cause the host to block waiting for the results. So if you want to use “async” on the compute region, then the only way to not have it block, is to manually manage the data movement of the reduction variable.

Also, if you only used the results of the reduction on the device, you can save the extra copy back to the host if you manually manage it. Though in most cases, the overhead to copy the variable is small so not worth the extra programming effort. But if you were performing the reduction many many times, it may be.

-Mat

Thank you for the clarification!

If I want to avoid the data movement between the host and the device can I use the kernel directive to set the data to zero on the device, like this:

!$acc data copyin(rtmp,x,y)
do i=1,10
   !$acc kernels(present)
      rtmp = 0.0d0
   !$acc end kernels
enddo

If I do this in our real code, the program does not output the correct result and when compiling the following message is generated:

Scalar last value needed after loop for ****** at line 862,917
Accelerator serial kernel generated
Generating Tesla code

Thank you for your help!

The syntax should be:

!$acc kernels default(present)



Scalar last value needed after loop for ****** at line 862,917
Accelerator serial kernel generated

I would need to see a fuller example to give specifics but I’ll do my best.

If the kernel is assigning a shared scalar, parallelizing the code would cause a race condition so the compiler would fall back to generating a sequential kernel. Plus it was parallelized, you don’t have a loop clause here so rtmp would be assigned in gang-redundant mode (i.e. all gangs would perform the same operation redundantly).

Though the “Scalar last value” line implies that you’re using the result of the scalar on the right hand side after the kernels region or the scalar is a module variable. In both cases, the compiler can’t implicitly privatize the variable since the assignment to a shared scalar would causes a race condition.

The solutions here are to instead privatize the variable if it’s only being used a temp variable, put it in a reduction clause if it’s a reduction, or use atomics to update the variable. Note that atomics don’t guarantee order, just that the assignment will be visible to all threads, so the result of an assignment to a shared scalar done in parallel will give different results depend on the order in which the threads are run.

-Mat

Thank you for the explanation! Yes, the value is used at the right hand side after the kernels region.