Hi,
I know this is an old post but I am experiencing a similar issue. My kernel time is being dominated by the code that sets a relatively large array to contain only zero values. I have been unable to find a way to bypass the need for the zeroed array. The problem is significant due to the large number of calls to the kernel.
I created the following dummy program with a variety of schedules and have managed to reduce the time relative to the intrinsic function. However, I am unhappy with hard-coding these values due to the potential changes in array size and hardware.
I also think further improvements must be possible, any suggestions on that front would be gratefully received.
I understand that the thrust library contains a function for problems such as this. Could anybody point me towards some sample code in which a call to the thrust library is implemented within either the Accelerator compiler of CUDA Fortan?
Obviously it would be nice if a speedy intrinsic function was available for this problem!
Cheers,
Karl
Timing results:
Implicit 1.428203
Dumb 2.538779
Explicit 1 1.428061
Explicit 2 1.205521
Explicit 3 1.181079
Explicit 4 1.177717
Explicit 5 1.174500
Explicit 6 1.174441
program zeroarray
use accel_lib
implicit none
real, dimension(:,:,:), allocatable :: test_array
real tima, timb
real time_implicit
real time_explicit_0, time_explicit_1, time_explicit_2, time_explicit_3
real time_explicit_4, time_explicit_5, time_explicit_6, time_explicit_7
integer it, maxit, i, j, k
integer arraysize
maxit = 50
arraysize = 800
call acc_init(acc_device_nvidia)
write(*,*) 'Tests for a', arraysize, '*', arraysize, '*', arraysize, 'array'
write(*,*) 'Array size in memory (Mb):', 8*((arraysize**3)/(1028*1028))
write(*,*) ' '
allocate(test_array(arraysize,arraysize,arraysize))
!$acc data region local(test_array)
! ====== IMPLICIT METHOD ======
call cpu_time(tima)
do it = 1 , maxit
!$acc region
test_array = 0.0
!$acc end region
end do
call cpu_time(timb)
time_implicit = timb-tima
! ====== DUMB METHOD ======
call cpu_time(tima)
do it = 1 , maxit
!$acc region
!$acc do
do i=1, arraysize
do j=1, arraysize
do k=1, arraysize
test_array(i,j,k) = 0.0
end do
end do
end do
!$acc end region
end do
call cpu_time(timb)
time_explicit_0 = timb-tima
! ====== EXPLICIT METHOD TUNED 1 ======
call cpu_time(tima)
do it = 1 , maxit
!$acc region
!$acc do
do k=1, arraysize
do j=1, arraysize
do i=1, arraysize
test_array(i,j,k) = 0.0
end do
end do
end do
!$acc end region
end do
call cpu_time(timb)
time_explicit_1 = timb-tima
! ====== EXPLICIT METHOD TUNED 2 ======
call cpu_time(tima)
do it = 1 , maxit
!$acc region
!$acc do parallel, vector(4)
do k=1, arraysize
!$acc do parallel, vector(4)
do j=1, arraysize
!$acc do vector(32)
do i=1, arraysize
test_array(i,j,k) = 0.0
end do
end do
end do
!$acc end region
end do
call cpu_time(timb)
time_explicit_2 = timb-tima
! ====== EXPLICIT METHOD TUNED 3 ======
call cpu_time(tima)
do it = 1 , maxit
!$acc region
!$acc do parallel, vector(8)
do k=1, arraysize
!$acc do parallel, vector(4)
do j=1, arraysize
!$acc do vector(32)
do i=1, arraysize
test_array(i,j,k) = 0.0
end do
end do
end do
!$acc end region
end do
call cpu_time(timb)
time_explicit_3 = timb-tima
! ====== EXPLICIT METHOD TUNED 4 ======
call cpu_time(tima)
do it = 1 , maxit
!$acc region
!$acc do
do k=1, arraysize
!$acc do parallel, vector(14)
do j=1, arraysize
!$acc do vector(32)
do i=1, arraysize
test_array(i,j,k) = 0.0
end do
end do
end do
!$acc end region
end do
call cpu_time(timb)
time_explicit_4 = timb-tima
! ====== EXPLICIT METHOD TUNED 5 ======
call cpu_time(tima)
do it = 1 , maxit
!$acc region
!$acc do parallel
do k=1, arraysize
!$acc do parallel vector(16)
do j=1, arraysize
!$acc do vector(32)
do i=1, arraysize
test_array(i,j,k) = 0.0
end do
end do
end do
!$acc end region
end do
call cpu_time(timb)
time_explicit_5 = timb-tima
! ====== EXPLICIT METHOD TUNED 6 ======
call cpu_time(tima)
do it = 1 , maxit
!$acc region
!$acc do
do k=1, arraysize
!$acc do parallel vector(16)
do j=1, arraysize
!$acc do vector(32)
do i=1, arraysize
test_array(i,j,k) = 0.0
end do
end do
end do
!$acc end region
end do
call cpu_time(timb)
time_explicit_6 = timb-tima
!$acc end data region
deallocate(test_array)
write(*,*) 'Timing results:'
write(*,*) ' Implicit ', time_implicit
write(*,*) ' Dumb ', time_explicit_0
write(*,*) ' Explicit 1', time_explicit_1
write(*,*) ' Explicit 2', time_explicit_2
write(*,*) ' Explicit 3', time_explicit_3
write(*,*) ' Explicit 4', time_explicit_4
write(*,*) ' Explicit 5', time_explicit_5
write(*,*) ' Explicit 6', time_explicit_6
!!!!![kaw2e11@UOS-205126 zeroarray]$ pgaccelinfo
!!!!!CUDA Driver Version: 4000
!!!!!NVRM version: NVIDIA UNIX x86_64 Kernel Module 275.21 Mon Jul 18 14:40:18 PDT 2011
!!!!!Device Name: Tesla C2075
!!!!!Device Revision Number: 2.0
!!!!!Number of Multiprocessors: 14
!!!!!Number of Cores: 448
!!!!!Warp Size: 32
end program zeroarray