Fast way to fill a gpu array with 0's

Hi, I wonder if there exists a fast intrinsic function for cuda fortran to fill an array with 0’s. Right now i am using the following simple kernel to fill an array with zeros:

attributes(global) subroutine zeros_kernel(zeros,n1,n2,n3)
       real*8,device :: zeros(0:n1,0:n2,0:n3)
       integer, value :: n1,n2,n3
       integer :: i, j, kb, k, tx, ty, tz

! Start execution, first get my thread indices

       tx = threadidx%x-1
       ty = threadidx%y-1
       tz = threadidx%z-1

! thread indices over grid

       i = (blockidx%x-1) * blockDim%x + tx
       j = (blockidx%y-1) * blockDim%y + ty

if (i .le. n1 .and. j .le. n2) then
do kb=0,n3-tz,2
zeros(i,j,kb+tz) = 0.d0
end do
end if

    end subroutine zeros_part1_kernel

Is there a faster way to fill the array with 0’s? Any ideas are welcome.

Hi,

I know this is an old post but I am experiencing a similar issue. My kernel time is being dominated by the code that sets a relatively large array to contain only zero values. I have been unable to find a way to bypass the need for the zeroed array. The problem is significant due to the large number of calls to the kernel.

I created the following dummy program with a variety of schedules and have managed to reduce the time relative to the intrinsic function. However, I am unhappy with hard-coding these values due to the potential changes in array size and hardware.
I also think further improvements must be possible, any suggestions on that front would be gratefully received.

I understand that the thrust library contains a function for problems such as this. Could anybody point me towards some sample code in which a call to the thrust library is implemented within either the Accelerator compiler of CUDA Fortan?

Obviously it would be nice if a speedy intrinsic function was available for this problem!

Cheers,

Karl

Timing results:
Implicit 1.428203
Dumb 2.538779
Explicit 1 1.428061
Explicit 2 1.205521
Explicit 3 1.181079
Explicit 4 1.177717
Explicit 5 1.174500
Explicit 6 1.174441

program zeroarray
     use accel_lib
     implicit none
     real, dimension(:,:,:), allocatable :: test_array
     real tima, timb
     real time_implicit
     real time_explicit_0, time_explicit_1, time_explicit_2, time_explicit_3
     real time_explicit_4, time_explicit_5, time_explicit_6, time_explicit_7
     integer it, maxit, i, j, k
     integer arraysize

     maxit = 50
     arraysize = 800

     call acc_init(acc_device_nvidia)
     write(*,*) 'Tests for a', arraysize, '*', arraysize, '*', arraysize, 'array'
     write(*,*) 'Array size in memory (Mb):', 8*((arraysize**3)/(1028*1028))
     write(*,*) ' '
     allocate(test_array(arraysize,arraysize,arraysize))


!$acc data region local(test_array)

! ====== IMPLICIT METHOD ======
     call cpu_time(tima)
     do it = 1 , maxit
!$acc region
           test_array = 0.0
!$acc end region
     end do
     call cpu_time(timb)
     time_implicit = timb-tima

! ====== DUMB METHOD ======
     call cpu_time(tima)
     do it = 1 , maxit
!$acc region
!$acc do
        do i=1, arraysize
           do j=1, arraysize
              do k=1, arraysize
                 test_array(i,j,k) = 0.0
              end do
           end do
        end do
!$acc end region
     end do
     call cpu_time(timb)
     time_explicit_0 = timb-tima

! ====== EXPLICIT METHOD TUNED 1 ======
     call cpu_time(tima)
     do it = 1 , maxit
!$acc region
!$acc do
        do k=1, arraysize
           do j=1, arraysize
              do i=1, arraysize
                 test_array(i,j,k) = 0.0
              end do
           end do
        end do
!$acc end region
     end do
     call cpu_time(timb)
     time_explicit_1 = timb-tima

! ====== EXPLICIT METHOD TUNED 2 ======
     call cpu_time(tima)
     do it = 1 , maxit
!$acc region

!$acc do parallel, vector(4)
        do k=1, arraysize
!$acc do parallel, vector(4)
           do j=1, arraysize
!$acc do vector(32)
              do i=1, arraysize
                 test_array(i,j,k) = 0.0
              end do
           end do
        end do
!$acc end region
     end do
     call cpu_time(timb)
     time_explicit_2 = timb-tima

! ====== EXPLICIT METHOD TUNED 3 ======
     call cpu_time(tima)
     do it = 1 , maxit
!$acc region
!$acc do parallel, vector(8)
        do k=1, arraysize
!$acc do parallel, vector(4)
           do j=1, arraysize
!$acc do vector(32)
              do i=1, arraysize
                 test_array(i,j,k) = 0.0
              end do
           end do
        end do
!$acc end region
     end do
     call cpu_time(timb)
     time_explicit_3 = timb-tima

! ====== EXPLICIT METHOD TUNED 4 ======
     call cpu_time(tima)
     do it = 1 , maxit
!$acc region

!$acc do
        do k=1, arraysize
!$acc do parallel, vector(14)
           do j=1, arraysize
!$acc do vector(32)
              do i=1, arraysize
                 test_array(i,j,k) = 0.0
              end do
           end do
        end do
!$acc end region
     end do
     call cpu_time(timb)
     time_explicit_4 = timb-tima


! ====== EXPLICIT METHOD TUNED 5 ======
     call cpu_time(tima)
     do it = 1 , maxit
!$acc region
!$acc do parallel
        do k=1, arraysize
!$acc do parallel vector(16)
           do j=1, arraysize
!$acc do vector(32)
              do i=1, arraysize
                 test_array(i,j,k) = 0.0
              end do
           end do
        end do
!$acc end region
     end do
     call cpu_time(timb)
     time_explicit_5 = timb-tima

! ====== EXPLICIT METHOD TUNED 6 ======
     call cpu_time(tima)
     do it = 1 , maxit
!$acc region
!$acc do
        do k=1, arraysize
!$acc do parallel vector(16)
           do j=1, arraysize
!$acc do vector(32)
              do i=1, arraysize
                 test_array(i,j,k) = 0.0
              end do
           end do
        end do
!$acc end region
     end do
     call cpu_time(timb)
     time_explicit_6 = timb-tima



!$acc end data region
     deallocate(test_array)
     write(*,*) 'Timing results:'
     write(*,*) '   Implicit  ', time_implicit
     write(*,*) '   Dumb      ', time_explicit_0
     write(*,*) '   Explicit 1', time_explicit_1
     write(*,*) '   Explicit 2', time_explicit_2
     write(*,*) '   Explicit 3', time_explicit_3
     write(*,*) '   Explicit 4', time_explicit_4
     write(*,*) '   Explicit 5', time_explicit_5
     write(*,*) '   Explicit 6', time_explicit_6

!!!!![kaw2e11@UOS-205126 zeroarray]$ pgaccelinfo
!!!!!CUDA Driver Version:           4000
!!!!!NVRM version: NVIDIA UNIX x86_64 Kernel Module  275.21  Mon Jul 18 14:40:18 PDT 2011
!!!!!Device Name:                   Tesla C2075
!!!!!Device Revision Number:        2.0
!!!!!Number of Multiprocessors:     14
!!!!!Number of Cores:               448
!!!!!Warp Size:                     32

end program zeroarray

Hi Karl,

I understand that the thrust library contains a function for problems such as this. Could anybody point me towards some sample code in which a call to the thrust library is implemented within either the Accelerator compiler of CUDA Fortan?

I’m skeptical thrust could help here. Besides having to use CUDA Fortran as the interface, you’d add the overhead of calling thrust. For an example of using Thrust with CUDA Fortran, please see: CUDA Musing: Calling Thrust from CUDA Fortran

I think the best solution would be for the compiler to use idiom recognition and instead of creating an implied do loop for “test_array=0” which then gets accelerated, to generate a call to cudaMemSet. This is what happens if test_array was a CUDA Fortran device array.

I have a similar request in for C, but added your specific example I added TPR#18614. Though, using cudaMemSet only gets ~30% speed-up, so the gain is not huge, but is a gain.

For example:

 % cat memset.f90
program zeroarray
     use accel_lib
     use cudafor
     implicit none
     real, dimension(:,:,:), allocatable :: test_array
     real, dimension(:,:,:), allocatable,device :: test_arrayD
     integer it, maxit
     integer arraysize

     maxit = 50
     arraysize = 800

     call acc_init(acc_device_nvidia)
     allocate(test_array(arraysize,arraysize,arraysize))
     allocate(test_arrayD(arraysize,arraysize,arraysize))

!$acc data region local(test_array)
     do it = 1 , maxit
!$acc region
         test_array = 0.0
!$acc end region
     end do
!$acc end data region

     do it = 1 , maxit
          test_arrayD=0.0
      end do

      deallocate(test_array)
      deallocate(test_arrayD)

end program zeroarray

% pgf90 -ta=nvidia memset.f90 -V12.3 -fast -Minfo -Mcuda -o memset.out
zeroarray:
17, Generating local(test_array(:,:,:))
18, Loop not vectorized/parallelized: contains call
20, Loop is parallelizable
Accelerator kernel generated
20, !$acc do vector(16) ! threadidx%x
!$acc do parallel, vector(4) ! blockidx%x threadidx%y
!$acc do parallel, vector(4) ! blockidx%y threadidx%z
25, Loop not vectorized/parallelized: contains call
% setenv CUDA_PROFILE 1
% memset.out
% perl totalProf.pl cuda_profile_0.log
                       GPU TIME 
__pgi_dev_cumemset_4    0.000680 
__pgi_dev_cumemset_4f   0.863719 
zeroarray_20_gpu        1.241995 
Totals:                 2.106394

Note “totalProf.pl” is just a simple perl script I wrote to sum up the values of a Cuda Profile log.

  • Mat

Hi Mat,

Thank you for that, it works nicely and also explains some niggling differences in performance I was seeing elsewhere when I was comparing a pure Accelerator version of a routine with a CUDA Fortran version.

Cheers,

Karl