how to carry out the sum operation in cuda fortran?

For a large size array,it’s fairly easy to realize the sum operation in cudaC via pointer, and I just wonder how to perform this operation efficiently in cuda fortran using GPU.
thanks!

Hi Bullish,

Can you please explain a bit more or post an example of what you mean by “sum operation”?

Performing sum reductions in parallel are quite difficult to perform efficiently, but no more so for Fortran then C. I wrote a basic one for an article I wrote (See: Account Login | PGI), but by no means is it optimal. NVIDIA has a good slide deck on reductions (See: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf) that helps explain the details.

  • Mat

Hi Mat,
Firstly thank you for your reply. The sum operation I mentioned is exactly the intrinsic function sum() in Fortran. I tried to rewrite function sum() with CUDA Fortran, and the GPU code is much slower than CPU.According to my knowledge, CUDA fortran doesn’t support direct memory address operation, so the GPU capability is hard to be fulled exploited even with the partial sum trick. Have you encountered such problem?

Hi Bullish,

Using the sum intrinsic from within a device kernel would be very slow since each thread would be performing the sum and need to access the device’s global memory. I would advice against using the reduction intrinsics in a device kernel unless you are reducing a small local or shared array.

To efficiently perform reductions, you should follow the partial reduction examples described earlier. Note that sum reductions on a GPU are not expected to be faster then the CPU. Rather, they should only be used if the cost to transfer the data is greater than the cost of the reduction.

Note that as of the 10.5 release, the PGI accelerator model is able to use CUDA Fortran device data. This will allow you to utilize the PGI accelerator’s highly optimized reductions within CUDA Fortran. For example from the host add the follow and tehn compile with “-ta=nvidia”.

!$acc region
  sumVal = sum(devArr)
!$acc end region

As for your question about direct memory address (DMA) operations, again I’m not clear as to what you mean. DMA has to do with how data is transferred to and from the CPU and GPU. Do you mean pinned memory (which is supported in CUDA Fortran)?

  • Mat

Hi Mat,

I know this is an old post, and I was wondering if there are news solutions now to sum arrays on device ?
Assuming that A_h and A_d are a two dimension real arrays, one on host the other on device.

How do I get the equivalent of “sum(A_h(:,i))” using A_d ?

Thank you in advance,
Remy

Hi Remy,

We’ve since added host side intrinsics that accept device data including sum. So here the equivalent would simply be “sum(A_d(:,i))”. For example:

% cat test.cuf

program foo
  use cudafor
  real, device, dimension(:,:), allocatable :: Arr_d
  real :: sumval

  allocate(Arr_d(64,64))
  Arr_d = 1.0
  sumval = sum(Arr_d(:,1))
  print *, sumval

end program foo
% nvfortran test.cuf ; a.out
    64.00000

Hope this helps,
Mat

Following this question, can i directly use SUM in a device kernel? will it be less efficient?

For example, is this code problematic?

attributes(global) subroutine mass_kernel_4(N_TRACKED_SPECIES, N_TOTAL_SCALARS, &
IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, SOLID, RHOS, ZZS )

  implicit none

integer*4 :: I, J, K

integer, value :: IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, N_TOTAL_SCALARS, N_TRACKED_SPECIES

LOGICAL, intent(in) :: SOLID(0:IBP1,0:JBP1,0:KBP1)
REAL(EB), intent(out), DIMENSION(0:IBP1,0:JBP1,0:KBP1) :: RHOS
REAL(EB), intent(in), DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS) :: ZZS

    !blockIdx%x starts from 1, which is different from cuda c/c++ code
    !threadIdx%x also starts from 1, in this case main arrays start from 1 in k, j,i,n dimentions, so we use threadIdx%x 
    !Note that some arrys start from 0, but they are not main arrays

I = (blockIdx%x-1) * blockDim%x + threadIdx%x
J = (blockIdx%y-1) * blockDim%y + threadIdx%y
K = (blockIdx%z-1) * blockDim%z + threadIdx%z

if(I .ge. 1 .and. I .le. IBAR &
        .and.  J .ge. 1 .and. J .le. JBAR &
        .and. K .ge. 1 .and. K .le. KBAR &
        .and. SOLID(I,J,K) .eq. .false. ) then
               RHOS(I,J,K) = SUM(ZZS(I,J,K,1:N_TRACKED_SPECIES))
endif

end subroutine mass_kernel_4

Thanks.

Hi Mat, in the bellow code, I can only get the correct answer when ZSS is initialized to zero:

attributes(global) subroutine mass_kernel_5(N_TOTAL_SCALARS, IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, ZZS,tt, tt_part )

  implicit none

integer*4 :: I, J, K,N

integer, value :: IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, N_TOTAL_SCALARS

REAL(8), intent(out), DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS) :: ZZS

REAL(8), intent(inout), dimension(0:IBP1,0:JBP1,0:KBP1)::  tt_part
REAL(8), intent(inout) ::  tt

!when define tt as value, the value of tt cannot be passed into the kernel from the calling
!function/subroutine
!REAL(8), value ::  tt 

    !blockIdx%x starts from 1, which is different from cuda c/c++ code
    !threadIdx%x also starts from 1, in this case main arrays start from 1 in k, j,i,n dimentions, so we use threadIdx%x 
    !Note that some arrys start from 0, but they are not main arrays

I = (blockIdx%x-1) * blockDim%x + threadIdx%x
J = (blockIdx%y-1) * blockDim%y + threadIdx%y
K = (blockIdx%z-1) * blockDim%z + threadIdx%z

if( I == 1 .and. J == 1 .and. K == 1 ) print *, "inside kernel, tt=", tt

!tt cannot be used for reduction, we have to define a 3d array tt_part
if( I .le. IBAR .and.  J .le. JBAR .and.  K .le. KBAR ) then
            DO N=1, N_TOTAL_SCALARS
               !ZZS(I,J,K,N) = ZZS(I,J,K,N) +  tt + N_TOTAL_SCALARS
               ZZS(I,J,K,N) = ZZS(I,J,K,N)

ENDDO
!tt=tt+1.0
!tt_part(I,J,K) = I + J + K
tt_part(I,J,K) = sum(ZZS(I,J,K,1:N_TOTAL_SCALARS))
endif

end subroutine mass_kernel_5

program mass_kw_5

!@cuf use cudafor

implicit none

integer, PARAMETER :: IBAR=36, JBAR=24, KBAR=24, IBP1=37, JBP1=25, KBP1=25, N_TOTAL_SCALARS=3

REAL(8),  DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS):: ZZS
REAL(8),  DIMENSION(0:IBP1,0:JBP1,0:KBP1) :: tt_part

REAL(8), device, DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS) :: D_ZZS
REAL(8),  device, DIMENSION(0:IBP1,0:JBP1,0:KBP1) :: D_tt_part


!REAL(8), dimension(1) :: tt 
!REAL(8), device, dimension(1) :: D_tt

REAL(8)  :: tt
REAL(8) , device :: D_tt

integer :: I,J,K,istat

integer*4 ::  blockDim_x=4, blockDim_y=4, blockDim_z=8
integer*4 :: gridDim_x, gridDim_y, gridDim_z

type(dim3) :: grid, block

!print *, size(tt)

istat=cudaMemset(D_ZZS, 0._8, size(ZZS) )
istat=cudaMemset(D_tt_part, 0._8, size(tt_part) )

!ZZS=0._8
ZZS = -1._8
!ZZS = 1._8

ZZS(1,13,3,1)=0.99576610439851176
ZZS(1,13,3,2)=0.0
ZZS(1,13,3,3)=0.030113580765101414

tt = 0.5
D_tt = tt
D_tt_part = tt_part
D_ZZS = ZZS

gridDim_x = (IBAR+blockDim_x-1)/blockDim_x
gridDim_y = (JBAR+blockDim_y-1)/blockDim_y
gridDim_z = (KBAR+blockDim_z-1)/blockDim_z

grid = dim3(gridDim_x, gridDim_y,gridDim_z)
block= dim3(blockDim_x, blockDim_y,blockDim_z)

print *, ZZS(1,13,3,1), ZZS(1,13,3,2), ZZS(1,13,3,3)

call mass_kernel_5<<<grid,block>>>(N_TOTAL_SCALARS, IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, D_ZZS, D_tt, D_tt_part )
!call mass_kernel_5<<<grid,block>>>(N_TOTAL_SCALARS, IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, D_ZZS, tt, D_tt_part )

istat = cudaDeviceSynchronize()

ZZS= D_ZZS
tt_part= D_tt_part

tt = sum(tt_part)

print *, ZZS(1,13,3,1), ZZS(1,13,3,2), ZZS(1,13,3,3)
print *, tt, sum(ZZS)

compiled by nvfortran -cuda -g ./test.f90
run by: ./a.out
test# ./a.out
0.9957661032676697 0.000000000000000 3.0113581568002701E-002
inside kernel, tt= 0.5000000000000000
0.5000000000000000
0.9957661032676697 0.000000000000000 3.0113581568002701E-002
-62203.97412031516 -77059.97412031516

I was expecting tt and sum(ZZS) are identical.

Thank you !

Sorry Mat, I got where I was wrong: tt_part and ZZS were not initialized the same value.

Thank you.