Hello,
I started looking at PGI accellerator programming paradigm since only a week. I am checking the feasibility of porting on gpu a large code parallelized with MPI.
I compare the accellerated loop to the traditional host executed loop using some fortran intrinsic function.
Just by using the !$acc clause i get an enormous speedup, however it appears that the innermost loop is not parallelizable because the variable idx_rot is re-used after the loop.
I post the bulk of the code, inside are calls to other function and data I/O that are not relevant for gpu operations.
module lapo_pixel
integer*4:: nlatzones !Number of rows of the grid
integer*4:: nvox !Total number of voxels
real*4,dimension(:),allocatable:: x_pixel,y_pixel,z_pixel,k_pixel !Number of voxels
real*4,dimension(:),allocatable:: x_pixct,y_pixct,z_pixct,k_pixct
real*4,dimension(:),allocatable:: th_pixel,phi_pixel
end module
program gpunight
use lapo_pixel
use accel_lib
implicit none
real*4, dimension(:),allocatable :: sigll,zigll
real*4, dimension(:), allocatable ::x,y,z,phigd,phird,dr
real*4, dimension(:), allocatable :: x_pix,y_pix,z_pix
real*4,dimension(:), allocatable :: xtemp,ytemp,dr_min
integer*4 :: nproc,npts2d,rows,nring_tot,ct,iring,npts3d,ipt
integer*4,dimension(:), allocatable :: idx_rot
real*4 ::dphi,pi,phi,radius,temp,dr2
integer*4 :: c0, c1, c2, c3, cgpu,chost,i,mindist
pi=3.141592653589793;
nproc=8;
rows=168;
npts2D=rows*nproc;
print*, 'npts2d:',rows*nproc,npts2d,'rows:',rows,'nproc',nproc
allocate(sigll(npts2d),zigll(npts2d))
allocate(xtemp(npts2d),ytemp(npts2d))
call load_grid_bdr(sigll,zigll,dphi,radius,rows,nproc)
nring_tot=ceiling(2*pi/dphi)
npts3d=npts2d*(nring_tot+1)
allocate(x(npts3d),y(npts3d),z(npts3d),dr(npts3d))
ct=0
do iring=0,nring_tot
phi=dphi*iring;
call cyl2cart(xtemp,ytemp,sigll,zigll,phi,rows*4);
x(ct*npts2D+1:(ct+1)*npts2D)=xtemp;
y(ct*npts2D+1:(ct+1)*npts2D)=ytemp;
z(ct*npts2D+1:(ct+1)*npts2D)=zigll;
ct=ct+1
enddo
call pixel_grid(radius)
allocate(x_pix(nvox),y_pix(nvox),z_pix(nvox))
allocate(idx_rot(nvox),dr_min(nvox))
call acc_init( acc_device_nvidia )
x_pix=x_pixct;y_pix=y_pixct;z_pix=z_pixct
print*, 'size x & xpix',size(x),size(x_pixct),npts3d,nvox
call system_clock( count=c1 )
!$acc region
!$acc do private(dr2,mindist)
do ipt=1,nvox
mindist=1000000000000000.
do i=1,npts3D
dr2=sqrt((-x_pix(ipt)+x(i))**2+(-y_pix(ipt)+y(i))**2+(-z_pix(ipt)+z(i))**2);
if (dr2<=mindist) then
idx_rot(ipt)=i;
end if
end do
end do
!$acc end region
call system_clock( count=c2 )
cgpu = c2 - c1
do ipt=1,nvox
dr=sqrt((-x_pixct(ipt)+x)**2+(-y_pixct(ipt)+y)**2+(-z_pixct(ipt)+z)**2);
idx_rot(ipt)=minloc(dr,1)
enddo
call system_clock( count=c3 )
chost = c3 - c2
print *, cgpu, ' microseconds on GPU'
print *, chost, ' microseconds on host'
endprogram gpunight
compiled with: pgfortran lapo_mesh.f90 -o xlapo.exe -O3 -fastsse -Minline -g -ta=nvidia,time -Minfo
That gives the followingoutput concerning gpu and host loop:
73, Generating copyin(x(1:(nring_tot+1)*1344))
Generating copyin(y_pix(1:nvox))
Generating copyin(y(1:(nring_tot+1)*1344))
Generating copyin(z_pix(1:nvox))
Generating copyin(z(1:(nring_tot+1)*1344))
Generating copy(idx_rot(1:nvox))
Generating copyin(x_pix(1:nvox))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
75, Loop is parallelizable
77, Loop carried reuse of ‘idx_rot’ prevents parallelization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
75, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
Using register for ‘idx_rot’
Using register for ‘x_pix’
Using register for ‘y_pix’
Using register for ‘z_pix’
77, !$acc do seq(256)
Cached references to size [256] block of ‘x’
Cached references to size [256] block of ‘y’
Cached references to size [256] block of ‘z’
CC 1.0 : 14 registers; 3180 shared, 20 constant, 0 local memory bytes; 66% occupancy
CC 1.3 : 14 registers; 3180 shared, 20 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 24 registers; 3092 shared, 112 constant, 0 local memory bytes; 83% occupancy
90, Loop not vectorized/parallelized: contains call
92, Generated 2 alternate versions of the loop
Generated vector sse code for the loop
Generated 3 prefetch instructions for the loop
The output from the code and the timing analysis is:
73: region entered 1 time
time(us): total=464729 init=1 region=464728
kernels=461062 data=2565
w/o init: total=464728 max=464728 min=464728 avg=464728
77: kernel launched 1 times
grid: [41] block: [256]
time(us): total=461062 max=461062 min=461062 avg=461062
acc_init.c
acc_init
41: region entered 1 time
time(us): init=2387859
464740 microseconds on GPU
37325989 microseconds on host
Already satisfactory I would say, but I have few question:
1-can the sequential loop inside the accelerated region be further optimised?
2-Inserting a data region clause around the compute region would improve the data management performance?
3-I noticed in a previous post the lack of the minlock/maxlock function, is it going to be available in the near future?
Thanks in advance