For a large size array,it’s fairly easy to realize the sum operation in cudaC via pointer, and I just wonder how to perform this operation efficiently in cuda fortran using GPU.
thanks!
Hi Bullish,
Can you please explain a bit more or post an example of what you mean by “sum operation”?
Performing sum reductions in parallel are quite difficult to perform efficiently, but no more so for Fortran then C. I wrote a basic one for an article I wrote (See: Account Login | PGI), but by no means is it optimal. NVIDIA has a good slide deck on reductions (See: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf) that helps explain the details.
- Mat
Hi Mat,
Firstly thank you for your reply. The sum operation I mentioned is exactly the intrinsic function sum() in Fortran. I tried to rewrite function sum() with CUDA Fortran, and the GPU code is much slower than CPU.According to my knowledge, CUDA fortran doesn’t support direct memory address operation, so the GPU capability is hard to be fulled exploited even with the partial sum trick. Have you encountered such problem?
Hi Bullish,
Using the sum intrinsic from within a device kernel would be very slow since each thread would be performing the sum and need to access the device’s global memory. I would advice against using the reduction intrinsics in a device kernel unless you are reducing a small local or shared array.
To efficiently perform reductions, you should follow the partial reduction examples described earlier. Note that sum reductions on a GPU are not expected to be faster then the CPU. Rather, they should only be used if the cost to transfer the data is greater than the cost of the reduction.
Note that as of the 10.5 release, the PGI accelerator model is able to use CUDA Fortran device data. This will allow you to utilize the PGI accelerator’s highly optimized reductions within CUDA Fortran. For example from the host add the follow and tehn compile with “-ta=nvidia”.
!$acc region
sumVal = sum(devArr)
!$acc end region
As for your question about direct memory address (DMA) operations, again I’m not clear as to what you mean. DMA has to do with how data is transferred to and from the CPU and GPU. Do you mean pinned memory (which is supported in CUDA Fortran)?
- Mat
Hi Mat,
I know this is an old post, and I was wondering if there are news solutions now to sum arrays on device ?
Assuming that A_h and A_d are a two dimension real arrays, one on host the other on device.
How do I get the equivalent of “sum(A_h(:,i))” using A_d ?
Thank you in advance,
Remy
Hi Remy,
We’ve since added host side intrinsics that accept device data including sum. So here the equivalent would simply be “sum(A_d(:,i))”. For example:
% cat test.cuf
program foo
use cudafor
real, device, dimension(:,:), allocatable :: Arr_d
real :: sumval
allocate(Arr_d(64,64))
Arr_d = 1.0
sumval = sum(Arr_d(:,1))
print *, sumval
end program foo
% nvfortran test.cuf ; a.out
64.00000
Hope this helps,
Mat
Following this question, can i directly use SUM in a device kernel? will it be less efficient?
For example, is this code problematic?
attributes(global) subroutine mass_kernel_4(N_TRACKED_SPECIES, N_TOTAL_SCALARS, &
IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, SOLID, RHOS, ZZS )
implicit none
integer*4 :: I, J, K
integer, value :: IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, N_TOTAL_SCALARS, N_TRACKED_SPECIES
LOGICAL, intent(in) :: SOLID(0:IBP1,0:JBP1,0:KBP1)
REAL(EB), intent(out), DIMENSION(0:IBP1,0:JBP1,0:KBP1) :: RHOS
REAL(EB), intent(in), DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS) :: ZZS
!blockIdx%x starts from 1, which is different from cuda c/c++ code
!threadIdx%x also starts from 1, in this case main arrays start from 1 in k, j,i,n dimentions, so we use threadIdx%x
!Note that some arrys start from 0, but they are not main arrays
I = (blockIdx%x-1) * blockDim%x + threadIdx%x
J = (blockIdx%y-1) * blockDim%y + threadIdx%y
K = (blockIdx%z-1) * blockDim%z + threadIdx%z
if(I .ge. 1 .and. I .le. IBAR &
.and. J .ge. 1 .and. J .le. JBAR &
.and. K .ge. 1 .and. K .le. KBAR &
.and. SOLID(I,J,K) .eq. .false. ) then
RHOS(I,J,K) = SUM(ZZS(I,J,K,1:N_TRACKED_SPECIES))
endif
end subroutine mass_kernel_4
Thanks.
Hi Mat, in the bellow code, I can only get the correct answer when ZSS is initialized to zero:
attributes(global) subroutine mass_kernel_5(N_TOTAL_SCALARS, IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, ZZS,tt, tt_part )
implicit none
integer*4 :: I, J, K,N
integer, value :: IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, N_TOTAL_SCALARS
REAL(8), intent(out), DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS) :: ZZS
REAL(8), intent(inout), dimension(0:IBP1,0:JBP1,0:KBP1):: tt_part
REAL(8), intent(inout) :: tt
!when define tt as value, the value of tt cannot be passed into the kernel from the calling
!function/subroutine
!REAL(8), value :: tt
!blockIdx%x starts from 1, which is different from cuda c/c++ code
!threadIdx%x also starts from 1, in this case main arrays start from 1 in k, j,i,n dimentions, so we use threadIdx%x
!Note that some arrys start from 0, but they are not main arrays
I = (blockIdx%x-1) * blockDim%x + threadIdx%x
J = (blockIdx%y-1) * blockDim%y + threadIdx%y
K = (blockIdx%z-1) * blockDim%z + threadIdx%z
if( I == 1 .and. J == 1 .and. K == 1 ) print *, "inside kernel, tt=", tt
!tt cannot be used for reduction, we have to define a 3d array tt_part
if( I .le. IBAR .and. J .le. JBAR .and. K .le. KBAR ) then
DO N=1, N_TOTAL_SCALARS
!ZZS(I,J,K,N) = ZZS(I,J,K,N) + tt + N_TOTAL_SCALARS
ZZS(I,J,K,N) = ZZS(I,J,K,N)
ENDDO
!tt=tt+1.0
!tt_part(I,J,K) = I + J + K
tt_part(I,J,K) = sum(ZZS(I,J,K,1:N_TOTAL_SCALARS))
endif
end subroutine mass_kernel_5
program mass_kw_5
!@cuf use cudafor
implicit none
integer, PARAMETER :: IBAR=36, JBAR=24, KBAR=24, IBP1=37, JBP1=25, KBP1=25, N_TOTAL_SCALARS=3
REAL(8), DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS):: ZZS
REAL(8), DIMENSION(0:IBP1,0:JBP1,0:KBP1) :: tt_part
REAL(8), device, DIMENSION(0:IBP1,0:JBP1,0:KBP1,1:N_TOTAL_SCALARS) :: D_ZZS
REAL(8), device, DIMENSION(0:IBP1,0:JBP1,0:KBP1) :: D_tt_part
!REAL(8), dimension(1) :: tt
!REAL(8), device, dimension(1) :: D_tt
REAL(8) :: tt
REAL(8) , device :: D_tt
integer :: I,J,K,istat
integer*4 :: blockDim_x=4, blockDim_y=4, blockDim_z=8
integer*4 :: gridDim_x, gridDim_y, gridDim_z
type(dim3) :: grid, block
!print *, size(tt)
istat=cudaMemset(D_ZZS, 0._8, size(ZZS) )
istat=cudaMemset(D_tt_part, 0._8, size(tt_part) )
!ZZS=0._8
ZZS = -1._8
!ZZS = 1._8
ZZS(1,13,3,1)=0.99576610439851176
ZZS(1,13,3,2)=0.0
ZZS(1,13,3,3)=0.030113580765101414
tt = 0.5
D_tt = tt
D_tt_part = tt_part
D_ZZS = ZZS
gridDim_x = (IBAR+blockDim_x-1)/blockDim_x
gridDim_y = (JBAR+blockDim_y-1)/blockDim_y
gridDim_z = (KBAR+blockDim_z-1)/blockDim_z
grid = dim3(gridDim_x, gridDim_y,gridDim_z)
block= dim3(blockDim_x, blockDim_y,blockDim_z)
print *, ZZS(1,13,3,1), ZZS(1,13,3,2), ZZS(1,13,3,3)
call mass_kernel_5<<<grid,block>>>(N_TOTAL_SCALARS, IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, D_ZZS, D_tt, D_tt_part )
!call mass_kernel_5<<<grid,block>>>(N_TOTAL_SCALARS, IBAR, JBAR, KBAR, IBP1, JBP1, KBP1, D_ZZS, tt, D_tt_part )
istat = cudaDeviceSynchronize()
ZZS= D_ZZS
tt_part= D_tt_part
tt = sum(tt_part)
print *, ZZS(1,13,3,1), ZZS(1,13,3,2), ZZS(1,13,3,3)
print *, tt, sum(ZZS)
compiled by nvfortran -cuda -g ./test.f90
run by: ./a.out
test# ./a.out
0.9957661032676697 0.000000000000000 3.0113581568002701E-002
inside kernel, tt= 0.5000000000000000
0.5000000000000000
0.9957661032676697 0.000000000000000 3.0113581568002701E-002
-62203.97412031516 -77059.97412031516
I was expecting tt and sum(ZZS) are identical.
Thank you !
Sorry Mat, I got where I was wrong: tt_part and ZZS were not initialized the same value.
Thank you.