Is it possible to use both OpenMP + CUDA in PGI fortran ?

yangmh1 · December 17, 2010, 12:56am

Dear All,

My jobs could be described like this,

do i=1,n
! copy an arry from host to device
dA = A(*,i)
! perform calculations on device
call sub1 <<<dim,block>>> (dA)
enddo

Because the device memory can’t save the whole arry A, the calculation should devided into “n” stpes. Because the copying is a bottleneck, I am trying to rewritte the code liking,

! copy a part of data at first
dA1=A(*,1)

do i=1,n

! begin openmp
!omp parallel …
!omp do
do icore=1,ncore
if (icore.eq.1) then
! copy an arry from host to device
if (i.lt.n) then
if (mod(i,2).eq.1) then
dA2 = A(,i+1)
else
dA1 = A(,i+1)
endif
endif
elseif (core.eq.2) then
if (mod(i,2).eq.1) then
call sub1 <<<dim,block>>> (dA1)
else
call sub1 <<<dim,block>>> (dA2)
endif
endif
!omp end do
!omp end parallel
enddo

Here I use two device arrays, and hope to overlap the data transfer and GPU calculation, liking MPI pragraming.

However, the program is failed, I don’t know does PGI fortran could support openmp+cuda ? does anyone has suggestions,

Thanks in advance

MatColgrove · December 17, 2010, 1:20am

Hi yangmh,

Yes, OpenMP can be used with CUDA Fortran and works quite well. Actually my next PGInsider article will be a tutorial on it.

First you must understand that a CPU process/thread can only be attached to a single GPU so in order for a program to use multiple GPUs it must be done from OpenMP, MPI, or other higher parallel layer. Because of this, you need to have the CPU parallel region encompass the CUDA code. Have each process/thread attach to a device and then divide the problem between the devices. In other words, if you have two threads, you’d have two dA1’s each holding half of A.

Here’s some sample code from the OpenMP CUDA Fortran program I wrote for my next article:

subroutine lifeMain ()
	
#ifdef _CUDA
	use cudafor
#endif
	use omp_lib
	implicit none

	integer :: Nx, Ny, Nt, temp, count, steps, i, j, angle, nthds, tnum
        integer :: numdev, devnum, istat, myseed, blocksize
	integer, allocatable, DEVICE dimension(:,:) :: dOld, dNew
	integer, allocatable, dimension(:,:) :: A, B
	integer :: start, end
	real, allocatable, dimension(:,:) :: rand
	character(len=80) :: rfile
	integer :: alive
#ifdef _CUDA
    	type(dim3) :: dimGrid, dimBlock
#else
        integer,dimension(3) :: dimGrid,dimBlock
#endif
	! account for the halo
	Nx = NXSIZE+2
	Ny = NYSIZE+2
	myseed = SEED
	allocate(A(Nx,Ny),B(Nx,Ny),rand(Nx,Ny))
	call random_seed(myseed)
	call random_number(rand)
	A=0
	do i=2,Nx-1
	   do j=2,Ny-1
	     if (rand(i,j) < initPercent) then
	        A(i,j) = 1
	     else
                A(i,j) = 0
             endif
           enddo
        enddo
        B=A

	! Set the initial conditions
	count = 0
	steps = 0
	temp = 0
	alive = 1		
#ifdef _CUDA
        istat = cudaGetDeviceCount(numdev)
#endif

!$omp parallel &
!$omp   shared(A,B,count,Nx,Ny,numdev,alive,steps), &
!$omp   private(tnum,dOld,dNew,dimGrid,dimBlock,devnum,istat,Nt, &
!$omp           i,j,blocksize,start,end)

#ifdef _OMP
	nthds = omp_get_num_threads()	
	tnum = omp_get_thread_num()
#else
        nthds = 1
        tnum=0
#endif

!$omp master
	
!$omp end master
!$omp barrier

	blocksize =  NXSIZE/nthds
	Nt = blocksize+2
	start = (tnum*blocksize)+1
	end = start + blocksize + 1

#ifdef _CUDA
        devnum = mod(tnum,numdev)
        istat = cudaSetDevice(devnum)
#endif
   	allocate(dNew(Nt,Ny), dOld(Nt,Ny))	

        dOld=0
	dNew=0
#ifdef _CUDA
	dimBlock = dim3(16,16,1)
        dimGrid = dim3((Nt+15)/16,(ny+15)/16,1)
#endif

	do, while (steps.lt.16384.and.count.lt.10.and.alive.gt.0)

	    dOld(1:Nt,1:Ny) = A(start:end,1:Ny)
            dNew=dOld
#ifdef _CUDA
            call life_kernel<<<dimGrid,dimBlock>>>(dOld,dNew,Nt,Ny)
#else
            call life(dOld,dNew,Nt,Ny)
#endif
	    B(start+1:end-1,1:Ny) = dNew(2:Nt-1,1:Ny)
!$omp barrier

!$omp master 
            A=B
	    alive = sum(A)
!            if (mod(steps,100).eq.0) then
            print *, 'Step:', steps, ' Alive:', alive
!            endif
            if (temp.lt.(alive+2).and.temp.gt.(alive-2)) then
               count = count + 1
	    else
	       count = 0
            endif
            temp = alive
	    steps = steps + 1

!	pause	

!$omp end master
!$omp barrier
 	
	end do
	
	deallocate(dOld)
	deallocate(dNew)
#ifdef _CUDA
	istat=cudaThreadExit()
#endif

!$omp end parallel
  
	alive = sum(A)
	print *, 'Step:', steps, ' Alive:',alive
	print *, 'Init%:', initPercent, ' Actual%:', real(alive)/real(NXSIZE*NYSIZE)
	close(21)
	deallocate(A,B)
		
	stop

end subroutine lifeMain

Hope this helps,
Mat

yangmh1 · December 17, 2010, 6:22am

Dear Mat,

Thansk for your reply. From you code, I understand how to programming with OpenMP and CUDA.

In my work, only one GPU device is used. I plan to use two CPU threads. These two threads attach to the same GPU device. one CPU thread copies data from CPU memory to GPU memory (dA1) and another CPU thread operaters the data have been copied to GPU memory (dA2). In next step, the data copied from CPU memory in saved in GPU memory (dA1) and the GPU calculation is performed with GPU memory (dA1). Something likes MPI_ISEND to overlap the data transfer and the calculations.

I don’t know how to do such a work.

Thanks for your help

Yangmh

MatColgrove · December 17, 2010, 8:16pm

Hi yangmh,

I don’t think it’s currently possible for multiple host threads to share a single device context but I’ll confirm with my contacts at NVIDIA.

See: linux - Can I share cuda GPU device memory between host processes? - Stack Overflow

Mat

yangmh1 · December 18, 2010, 12:30am

Mat,

I see. Thanks very much.

-YangMH

Topic		Replies	Views
About two or more GPUs Legacy PGI Compilers	6	7159	July 31, 2012
CUDA Fortran+Openmp problem Legacy PGI Compilers	9	1149	March 3, 2022
Problem declaring device types inside OpenMP parallel loop. Legacy PGI Compilers	4	8921	April 12, 2010
OpenMP + CUDA Fortran issue Legacy PGI Compilers	1	3181	February 8, 2011
Multi-dimension array allocation problem Legacy PGI Compilers	2	2351	November 30, 2017
OpenMP + OpenACC problem Legacy PGI Compilers	9	5280	April 17, 2019
Device selection issue with OpenMP target mixed with do concurrent/OpenACC nvc, nvc++ and nvfortran	4	467	December 1, 2023
Problem using CUDA Visual Profiler for CUDA Fortran Legacy PGI Compilers	1	5094	August 6, 2012
About PGI Fortran and CUDA 4.0 Legacy PGI Compilers	24	13013	August 12, 2011
muti-gpu via using OpenMP Legacy PGI Compilers	2	2448	January 24, 2014

Is it possible to use both OpenMP + CUDA in PGI fortran ?

Related topics