how to carry out the sum operation in cuda fortran?

Bullish · May 10, 2010, 9:37am

For a large size array,it’s fairly easy to realize the sum operation in cudaC via pointer, and I just wonder how to perform this operation efficiently in cuda fortran using GPU.
thanks!

MatColgrove · May 10, 2010, 11:29pm

Hi Bullish,

Can you please explain a bit more or post an example of what you mean by “sum operation”?

Performing sum reductions in parallel are quite difficult to perform efficiently, but no more so for Fortran then C. I wrote a basic one for an article I wrote (See: Account Login | PGI), but by no means is it optimal. NVIDIA has a good slide deck on reductions (See: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf) that helps explain the details.

Mat

Bullish · May 11, 2010, 7:31am

Hi Mat,
Firstly thank you for your reply. The sum operation I mentioned is exactly the intrinsic function sum() in Fortran. I tried to rewrite function sum() with CUDA Fortran, and the GPU code is much slower than CPU.According to my knowledge, CUDA fortran doesn’t support direct memory address operation, so the GPU capability is hard to be fulled exploited even with the partial sum trick. Have you encountered such problem?

MatColgrove · May 11, 2010, 4:22pm

Hi Bullish,

Using the sum intrinsic from within a device kernel would be very slow since each thread would be performing the sum and need to access the device’s global memory. I would advice against using the reduction intrinsics in a device kernel unless you are reducing a small local or shared array.

To efficiently perform reductions, you should follow the partial reduction examples described earlier. Note that sum reductions on a GPU are not expected to be faster then the CPU. Rather, they should only be used if the cost to transfer the data is greater than the cost of the reduction.

Note that as of the 10.5 release, the PGI accelerator model is able to use CUDA Fortran device data. This will allow you to utilize the PGI accelerator’s highly optimized reductions within CUDA Fortran. For example from the host add the follow and tehn compile with “-ta=nvidia”.

!$acc region
  sumVal = sum(devArr)
!$acc end region

As for your question about direct memory address (DMA) operations, again I’m not clear as to what you mean. DMA has to do with how data is transferred to and from the CPU and GPU. Do you mean pinned memory (which is supported in CUDA Fortran)?

Mat

bretin.remy · April 17, 2023, 7:13pm

Hi Mat,

I know this is an old post, and I was wondering if there are news solutions now to sum arrays on device ?
Assuming that A_h and A_d are a two dimension real arrays, one on host the other on device.

How do I get the equivalent of “sum(A_h(:,i))” using A_d ?

Thank you in advance,
Remy

MatColgrove · April 17, 2023, 7:58pm

Hi Remy,

We’ve since added host side intrinsics that accept device data including sum. So here the equivalent would simply be “sum(A_d(:,i))”. For example:

% cat test.cuf

program foo
  use cudafor
  real, device, dimension(:,:), allocatable :: Arr_d
  real :: sumval

  allocate(Arr_d(64,64))
  Arr_d = 1.0
  sumval = sum(Arr_d(:,1))
  print *, sumval

end program foo
% nvfortran test.cuf ; a.out
    64.00000

Hope this helps,
Mat

honggangwang1979 · January 22, 2024, 2:50am

Following this question, can i directly use SUM in a device kernel? will it be less efficient?

For example, is this code problematic?

attributes(global) subroutine mass_kernel_4(N_TRACKED_SPECIES, N_TOTAL_SCALARS, &
IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, SOLID, RHOS, ZZS )

  implicit none

integer*4 :: I, J, K

integer, value :: IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, N_TOTAL_SCALARS, N_TRACKED_SPECIES

LOGICAL, intent(in) :: SOLID(0:IBP1,0:JBP1,0:KBP1)
REAL(EB), intent(out), DIMENSION(0:IBP1,0:JBP1,0:KBP1) :: RHOS
REAL(EB), intent(in), DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS) :: ZZS

    !blockIdx%x starts from 1, which is different from cuda c/c++ code
    !threadIdx%x also starts from 1, in this case main arrays start from 1 in k, j,i,n dimentions, so we use threadIdx%x 
    !Note that some arrys start from 0, but they are not main arrays

I = (blockIdx%x-1) * blockDim%x + threadIdx%x
J = (blockIdx%y-1) * blockDim%y + threadIdx%y
K = (blockIdx%z-1) * blockDim%z + threadIdx%z

if(I .ge. 1 .and. I .le. IBAR &
        .and.  J .ge. 1 .and. J .le. JBAR &
        .and. K .ge. 1 .and. K .le. KBAR &
        .and. SOLID(I,J,K) .eq. .false. ) then
               RHOS(I,J,K) = SUM(ZZS(I,J,K,1:N_TRACKED_SPECIES))
endif

end subroutine mass_kernel_4

Thanks.

honggangwang1979 · January 22, 2024, 3:07am

Hi Mat, in the bellow code, I can only get the correct answer when ZSS is initialized to zero:

attributes(global) subroutine mass_kernel_5(N_TOTAL_SCALARS, IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, ZZS,tt, tt_part )

  implicit none

integer*4 :: I, J, K,N

integer, value :: IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, N_TOTAL_SCALARS

REAL(8), intent(out), DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS) :: ZZS

REAL(8), intent(inout), dimension(0:IBP1,0:JBP1,0:KBP1)::  tt_part
REAL(8), intent(inout) ::  tt

!when define tt as value, the value of tt cannot be passed into the kernel from the calling
!function/subroutine
!REAL(8), value ::  tt 

    !blockIdx%x starts from 1, which is different from cuda c/c++ code
    !threadIdx%x also starts from 1, in this case main arrays start from 1 in k, j,i,n dimentions, so we use threadIdx%x 
    !Note that some arrys start from 0, but they are not main arrays

I = (blockIdx%x-1) * blockDim%x + threadIdx%x
J = (blockIdx%y-1) * blockDim%y + threadIdx%y
K = (blockIdx%z-1) * blockDim%z + threadIdx%z

if( I == 1 .and. J == 1 .and. K == 1 ) print *, "inside kernel, tt=", tt

!tt cannot be used for reduction, we have to define a 3d array tt_part
if( I .le. IBAR .and.  J .le. JBAR .and.  K .le. KBAR ) then
            DO N=1, N_TOTAL_SCALARS
               !ZZS(I,J,K,N) = ZZS(I,J,K,N) +  tt + N_TOTAL_SCALARS
               ZZS(I,J,K,N) = ZZS(I,J,K,N)

ENDDO
!tt=tt+1.0
!tt_part(I,J,K) = I + J + K
tt_part(I,J,K) = sum(ZZS(I,J,K,1:N_TOTAL_SCALARS))
endif

end subroutine mass_kernel_5

program mass_kw_5

!@cuf use cudafor

implicit none

integer, PARAMETER :: IBAR=36, JBAR=24, KBAR=24, IBP1=37, JBP1=25, KBP1=25, N_TOTAL_SCALARS=3

REAL(8),  DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS):: ZZS
REAL(8),  DIMENSION(0:IBP1,0:JBP1,0:KBP1) :: tt_part

REAL(8), device, DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS) :: D_ZZS
REAL(8),  device, DIMENSION(0:IBP1,0:JBP1,0:KBP1) :: D_tt_part


!REAL(8), dimension(1) :: tt 
!REAL(8), device, dimension(1) :: D_tt

REAL(8)  :: tt
REAL(8) , device :: D_tt

integer :: I,J,K,istat

integer*4 ::  blockDim_x=4, blockDim_y=4, blockDim_z=8
integer*4 :: gridDim_x, gridDim_y, gridDim_z

type(dim3) :: grid, block

!print *, size(tt)

istat=cudaMemset(D_ZZS, 0._8, size(ZZS) )
istat=cudaMemset(D_tt_part, 0._8, size(tt_part) )

!ZZS=0._8
ZZS = -1._8
!ZZS = 1._8

ZZS(1,13,3,1)=0.99576610439851176
ZZS(1,13,3,2)=0.0
ZZS(1,13,3,3)=0.030113580765101414

tt = 0.5
D_tt = tt
D_tt_part = tt_part
D_ZZS = ZZS

gridDim_x = (IBAR+blockDim_x-1)/blockDim_x
gridDim_y = (JBAR+blockDim_y-1)/blockDim_y
gridDim_z = (KBAR+blockDim_z-1)/blockDim_z

grid = dim3(gridDim_x, gridDim_y,gridDim_z)
block= dim3(blockDim_x, blockDim_y,blockDim_z)

print *, ZZS(1,13,3,1), ZZS(1,13,3,2), ZZS(1,13,3,3)

call mass_kernel_5<<<grid,block>>>(N_TOTAL_SCALARS, IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, D_ZZS, D_tt, D_tt_part )
!call mass_kernel_5<<<grid,block>>>(N_TOTAL_SCALARS, IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, D_ZZS, tt, D_tt_part )

istat = cudaDeviceSynchronize()

ZZS= D_ZZS
tt_part= D_tt_part

tt = sum(tt_part)

print *, ZZS(1,13,3,1), ZZS(1,13,3,2), ZZS(1,13,3,3)
print *, tt, sum(ZZS)

compiled by nvfortran -cuda -g ./test.f90
run by: ./a.out
test# ./a.out
0.9957661032676697 0.000000000000000 3.0113581568002701E-002
inside kernel, tt= 0.5000000000000000
0.5000000000000000
0.9957661032676697 0.000000000000000 3.0113581568002701E-002
-62203.97412031516 -77059.97412031516

I was expecting tt and sum(ZZS) are identical.

Thank you !

honggangwang1979 · January 22, 2024, 3:24am

Sorry Mat, I got where I was wrong: tt_part and ZZS were not initialized the same value.

Thank you.

Topic		Replies	Views
CUDA FORTRAN shared memory warp-level sum reduction Legacy PGI Compilers	1	3450	May 19, 2014
sequential sum within a kernel. CUDA Programming and Performance	23	5336	September 8, 2008
sum reduction Legacy PGI Compilers	3	3327	August 29, 2017
Sum reduction working in Fermi, Kepler and Maxwell CUDA Programming and Performance	10	3787	February 1, 2016
help me with my first CUDA Fortran program. Legacy PGI Compilers	4	8541	March 31, 2010
convert Matlab array multiplication and sum function to CUDA equivalent CUDA Programming and Performance	12	13315	August 17, 2010
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13917	December 21, 2012
CUDA - calculation of a sum CUDA Programming and Performance	7	5752	April 30, 2010
How to correctly apply reduction of array elements on GPU without the need to copy between host and device? nvc, nvc++ and nvfortran cuda	5	1011	March 3, 2023
Understanding and adjusting Mark Harris's array reduction CUDA Programming and Performance	11	4789	August 26, 2018

how to carry out the sum operation in cuda fortran?

Related topics