Is it possible to use both OpenMP + CUDA in PGI fortran ?

Dear All,

My jobs could be described like this,

do i=1,n
! copy an arry from host to device
dA = A(*,i)
! perform calculations on device
call sub1 <<<dim,block>>> (dA)
enddo

Because the device memory can’t save the whole arry A, the calculation should devided into “n” stpes. Because the copying is a bottleneck, I am trying to rewritte the code liking,

! copy a part of data at first
dA1=A(*,1)

do i=1,n

! begin openmp
!omp parallel …
!omp do
do icore=1,ncore
if (icore.eq.1) then
! copy an arry from host to device
if (i.lt.n) then
if (mod(i,2).eq.1) then
dA2 = A(,i+1)
else
dA1 = A(
,i+1)
endif
endif
elseif (core.eq.2) then
if (mod(i,2).eq.1) then
call sub1 <<<dim,block>>> (dA1)
else
call sub1 <<<dim,block>>> (dA2)
endif
endif
!omp end do
!omp end parallel
enddo

Here I use two device arrays, and hope to overlap the data transfer and GPU calculation, liking MPI pragraming.

However, the program is failed, I don’t know does PGI fortran could support openmp+cuda ? does anyone has suggestions,

Thanks in advance

Hi yangmh,

Yes, OpenMP can be used with CUDA Fortran and works quite well. Actually my next PGInsider article will be a tutorial on it.

First you must understand that a CPU process/thread can only be attached to a single GPU so in order for a program to use multiple GPUs it must be done from OpenMP, MPI, or other higher parallel layer. Because of this, you need to have the CPU parallel region encompass the CUDA code. Have each process/thread attach to a device and then divide the problem between the devices. In other words, if you have two threads, you’d have two dA1’s each holding half of A.

Here’s some sample code from the OpenMP CUDA Fortran program I wrote for my next article:

subroutine lifeMain ()
	
#ifdef _CUDA
	use cudafor
#endif
	use omp_lib
	implicit none

	integer :: Nx, Ny, Nt, temp, count, steps, i, j, angle, nthds, tnum
        integer :: numdev, devnum, istat, myseed, blocksize
	integer, allocatable, DEVICE dimension(:,:) :: dOld, dNew
	integer, allocatable, dimension(:,:) :: A, B
	integer :: start, end
	real, allocatable, dimension(:,:) :: rand
	character(len=80) :: rfile
	integer :: alive
#ifdef _CUDA
    	type(dim3) :: dimGrid, dimBlock
#else
        integer,dimension(3) :: dimGrid,dimBlock
#endif
	! account for the halo
	Nx = NXSIZE+2
	Ny = NYSIZE+2
	myseed = SEED
	allocate(A(Nx,Ny),B(Nx,Ny),rand(Nx,Ny))
	call random_seed(myseed)
	call random_number(rand)
	A=0
	do i=2,Nx-1
	   do j=2,Ny-1
	     if (rand(i,j) < initPercent) then
	        A(i,j) = 1
	     else
                A(i,j) = 0
             endif
           enddo
        enddo
        B=A

	! Set the initial conditions
	count = 0
	steps = 0
	temp = 0
	alive = 1		
#ifdef _CUDA
        istat = cudaGetDeviceCount(numdev)
#endif

!$omp parallel &
!$omp   shared(A,B,count,Nx,Ny,numdev,alive,steps), &
!$omp   private(tnum,dOld,dNew,dimGrid,dimBlock,devnum,istat,Nt, &
!$omp           i,j,blocksize,start,end)

#ifdef _OMP
	nthds = omp_get_num_threads()	
	tnum = omp_get_thread_num()
#else
        nthds = 1
        tnum=0
#endif

!$omp master
	
!$omp end master
!$omp barrier

	blocksize =  NXSIZE/nthds
	Nt = blocksize+2
	start = (tnum*blocksize)+1
	end = start + blocksize + 1

#ifdef _CUDA
        devnum = mod(tnum,numdev)
        istat = cudaSetDevice(devnum)
#endif
   	allocate(dNew(Nt,Ny), dOld(Nt,Ny))	

        dOld=0
	dNew=0
#ifdef _CUDA
	dimBlock = dim3(16,16,1)
        dimGrid = dim3((Nt+15)/16,(ny+15)/16,1)
#endif

	do, while (steps.lt.16384.and.count.lt.10.and.alive.gt.0)

	    dOld(1:Nt,1:Ny) = A(start:end,1:Ny)
            dNew=dOld
#ifdef _CUDA
            call life_kernel<<<dimGrid,dimBlock>>>(dOld,dNew,Nt,Ny)
#else
            call life(dOld,dNew,Nt,Ny)
#endif
	    B(start+1:end-1,1:Ny) = dNew(2:Nt-1,1:Ny)
!$omp barrier

!$omp master 
            A=B
	    alive = sum(A)
!            if (mod(steps,100).eq.0) then
            print *, 'Step:', steps, ' Alive:', alive
!            endif
            if (temp.lt.(alive+2).and.temp.gt.(alive-2)) then
               count = count + 1
	    else
	       count = 0
            endif
            temp = alive
	    steps = steps + 1

!	pause	

!$omp end master
!$omp barrier
 	
	end do
	
	deallocate(dOld)
	deallocate(dNew)
#ifdef _CUDA
	istat=cudaThreadExit()
#endif

!$omp end parallel
  
	alive = sum(A)
	print *, 'Step:', steps, ' Alive:',alive
	print *, 'Init%:', initPercent, ' Actual%:', real(alive)/real(NXSIZE*NYSIZE)
	close(21)
	deallocate(A,B)
		
	stop

end subroutine lifeMain

Hope this helps,
Mat

Dear Mat,

Thansk for your reply. From you code, I understand how to programming with OpenMP and CUDA.

In my work, only one GPU device is used. I plan to use two CPU threads. These two threads attach to the same GPU device. one CPU thread copies data from CPU memory to GPU memory (dA1) and another CPU thread operaters the data have been copied to GPU memory (dA2). In next step, the data copied from CPU memory in saved in GPU memory (dA1) and the GPU calculation is performed with GPU memory (dA1). Something likes MPI_ISEND to overlap the data transfer and the calculations.

I don’t know how to do such a work.

Thanks for your help

Yangmh

Hi yangmh,

I don’t think it’s currently possible for multiple host threads to share a single device context but I’ll confirm with my contacts at NVIDIA.

See: http://stackoverflow.com/questions/3300605/can-i-share-cuda-gpu-device-memory-between-host-processes

  • Mat

Mat,

I see. Thanks very much.

-YangMH