About two or more GPUs

Does the compiler support two or more GPUs in the same program?

we find the answer in your website that current release does not include support to automatically control two or more GPUs from the same accelerator region.
Because the system has four GPUs, when will the new release support more GPUs in the same program?

The compiler support more than one GPU just fine. I often run on 32+. The only thing you need to do is set the device in a logical way and use another MP framework on top.

For example, I use MPI to partition work and then, in each MPI process, do some ‘mod 2’ math to make sure process 0 uses gpu 0 and process 1 uses gpu 1, say. You can do this with either CUDA API calls with CUDA Fortran or acc_set_device_num (and associated Runtime Library Routines) with the accelerator pragmas (and with OpenACC too, I think, as of 12.3).

Matt

does not include support to automatically control two or more GPUs from the same accelerator region.

Multiple needs to be performed using a CPU parallel model such as OpenMP or MPI. The complexity of discrete memories makes it impractical for automatic decomposition across multi-GPUs.

I’ve written two articles on Multi-GPU programming that you may find helpful. The first use CUDA Fortran, Multi-GPU Programming Using CUDA Fortran, MPI, and GPUDirect, and the second uses the PGI Accelerator Model, 5x in 5 Hours: Porting a 3D Elastic Wave Simulator to GPUs Using PGI Accelerator . Both programs use MPI since I was targeting Clusters, but I also find using MPI easier when working with multiple GPUs.

  • Mat

Hi,Mat, I have two GPUs in one system,one is Quadro 4000 and the other is Tesla C2050. when I run pgaccelinfo ,I got the information of these two GPUs,but if I want to know which gpus accelerate my code, how can I do ? add -minfo flag?

I use -minfo flag and get the below information:

[zhanghw@localhost openacc]$ pgfortran -o f2a.exe acc_f2a.f90 -acc -Minfo=accel -fast
NOTE: your trial license will expire in 13 days, 13.4 hours.
NOTE: your trial license will expire in 13 days, 13.4 hours.
main:
27, Generating copyin(a(1:n))
Generating copyout(r(1:n))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
28, Loop is parallelizable
Accelerator kernel generated
28, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
CC 1.0 : 12 registers; 56 shared, 112 constant, 28 local memory bytes; 66% occupancy
CC 2.0 : 15 registers; 4 shared, 136 constant, 4 local memory bytes; 100% occupancy

but my two GPUs’ CC shoudld be 2.0 both

Hi Teslalady,

In order to have your program use both GPUs, you will need to use a higher level parallel model, OpenMP or MPI. To select which GPU a binary uses, you must either call the OpenACC runtime routine “acc_set_device_num” from within your program or set the environment flag “ACC_DEVICE_NUM” to the device you wish to use. If neither are set, the default is to use device 0.

but my two GPUs’ CC shoudld be 2.0 both

Since the build can be done on a system different then where it is run, the compiler does not use information about the GPUs attached to build system. Instead, it generates multiple embedded device binaries, in this case for compute capability 1.1 and 2.0. At runtime, the appropriate binary will be used. If you know that this binary will never be run any other devices and don’t want the small amount of size in your binary to store the 1.1 version, then add the flag “-ta=nvidia,2.0” to only target a CC2.0 device.

  • Mat

Hi,

Can somebody please post a sample example of OpenMP + Multi-GPU + CUDA Fortran? I have seen enough examples on MPI, but not any single running example on OpenMP.

Thanks in advance.

Hi Balkrishna,

Below is some code that uses OpenMP+CUDA Fortran. I had originally written it for my article on Multi-GPU programming (https://www.pgroup.com/lit/articles/insider/v3n3a2.htm) but decided to just focus on MPI+CUDA Fortran.

The biggest things to watch out for are your device allocatables since they need to be managed by separate threads. Be sure to put to put them in a “private” clause and don’t allocate them till after the GPU context has been created (i.e. after cudaSetDevice has been called). Also, unless you’re using a Kepler, each OpenMP thread should have their own GPU.

Let me know if you have questions,
Mat

% cat life.F90
#ifdef _CUDA
#define DEVICE device,
#else
#define DEVICE 
#endif

module life_mod
 
    integer :: NXSIZE, NYSIZE, SEED
    parameter(NXSIZE=128, NYSIZE=128, SEED=123)
    real :: initPercent=0.4

contains

#ifdef _CUDA
attributes(global) subroutine life_kernel(dOld, dNew, Nx, Ny)

	implicit none
	integer, dimension(Nx,Ny), device :: dOld, dNew
	integer, value :: Nx,Ny
	integer :: i, j, ix, iy, neighbors, state

	ix = (blockIdx%x-1)*blockDim%x + threadIdx%x
	iy = (blockIdx%y-1)*blockDim%y + threadIdx%y

	if (ix .gt. 1 .and. iy.gt.1 .and. &
            ix .lt. Nx .and. iy .lt. Ny) then
	
            neighbors = dOld(ix,iy-1) + &
	                dOld(ix,iy+1) +	&
	                dOld(ix+1,iy-1) + &	
	                dOld(ix+1,iy+1) + &	
	                dOld(ix-1,iy-1) + &	
	                dOld(ix-1,iy+1) + &	
	                dOld(ix-1,iy) +	&
	                dOld(ix+1,iy)
	state = dOld(ix,iy)
	
	if (state .eq. 0 .and. neighbors .eq. 3) then
	   dNew(ix,iy) = 1  ! birth
	else if (state.eq.1.and.neighbors.ne.2.and.neighbors.ne.3) then
	   dNew(ix,iy) = 0  ! death
	else
	   dNew(ix,iy) = state ! no change
        end if
	end if
	
end subroutine life_kernel

#else
subroutine life(dOld, dNew, Nx, Ny)

	implicit none
	integer, dimension(Nx,Ny) :: dOld, dNew
	integer, value :: Nx,Ny,tid
	integer :: i, j,ix,iy, neighbors, state

        do ix=2,Nx-1
	do iy=2,Ny-1
      
            neighbors = dOld(ix,iy-1) + &
	                dOld(ix,iy+1) +	&
	                dOld(ix+1,iy-1) + &	
	                dOld(ix+1,iy+1) + &	
	                dOld(ix-1,iy-1) + &	
	                dOld(ix-1,iy+1) + &	
	                dOld(ix-1,iy) +	&
	                dOld(ix+1,iy) 

	    state = dOld(ix,iy)

	    if (state .eq. 0 .and. neighbors .eq. 3) then
	      dNew(ix,iy) = 1  ! birth
	    else if (state.gt.0.and.neighbors.ne.2.and.neighbors.ne.3) then
	      dNew(ix,iy) = 0  ! death 0
	    else
              dNew(ix,iy) = state ! no change 
             end if

      end do
   end do

end subroutine life

#endif

subroutine lifeMain ()
	
#ifdef _CUDA
	use cudafor
#endif
#ifdef _OMP
	use omp_lib
#endif
	implicit none

	integer :: Nx, Ny, Nt, temp, count, steps, i, j, angle, nthds, tnum
        integer :: numdev, devnum, istat, myseed, blocksize
	integer, allocatable, DEVICE dimension(:,:) :: dOld, dNew
	integer, allocatable, dimension(:,:) :: A, B
	integer :: start, end
	real, allocatable, dimension(:,:) :: rand
	character(len=80) :: rfile
	integer :: alive
#ifdef _CUDA
    	type(dim3) :: dimGrid, dimBlock
#else
        integer,dimension(3) :: dimGrid,dimBlock
#endif
	! account for the halo
	Nx = NXSIZE+2
	Ny = NYSIZE+2
	myseed = SEED
	allocate(A(Nx,Ny),B(Nx,Ny),rand(Nx,Ny))
	call random_seed(myseed)
	call random_number(rand)
	A=0
	do i=2,Nx-1
	   do j=2,Ny-1
	     if (rand(i,j) < initPercent) then
	        A(i,j) = 1
	     else
                A(i,j) = 0
             endif
           enddo
        enddo
        B=A

	! Set the inital conditions
	count = 0
	steps = 0
	temp = 0
	alive = 1		
#ifdef _CUDA
        istat = cudaGetDeviceCount(numdev)
#endif

!$omp parallel &
!$omp   shared(A,B,count,Nx,Ny,numdev,alive,steps), &
!$omp   private(tnum,dOld,dNew,dimGrid,dimBlock,devnum,istat,Nt, &
!$omp           i,j,blocksize,start,end)

#ifdef _OMP
	nthds = omp_get_num_threads()	
	tnum = omp_get_thread_num()
#else
        nthds = 1
        tnum=0
#endif

!$omp master
	
!$omp end master
!$omp barrier

	blocksize =  NXSIZE/nthds
	Nt = blocksize+2
	start = (tnum*blocksize)+1
	end = start + blocksize + 1

#ifdef _CUDA
        devnum = mod(tnum,numdev)
        istat = cudaSetDevice(devnum)
#endif
   	allocate(dNew(Nt,Ny), dOld(Nt,Ny))	

        dOld=0
	dNew=0
#ifdef _CUDA
	dimBlock = dim3(16,16,1)
        dimGrid = dim3((Nt+15)/16,(ny+15)/16,1)
#endif

	do, while (steps.lt.16384.and.count.lt.10.and.alive.gt.0)

	    dOld(1:Nt,1:Ny) = A(start:end,1:Ny)
            dNew=dOld
#ifdef _CUDA
            call life_kernel<<<dimGrid,dimBlock>>>(dOld,dNew,Nt,Ny)
#else
            call life(dOld,dNew,Nt,Ny)
#endif
	    B(start+1:end-1,1:Ny) = dNew(2:Nt-1,1:Ny)
!$omp barrier

!$omp master 
            A=B
	    alive = sum(A)
!            if (mod(steps,100).eq.0) then
            print *, 'Step:', steps, ' Alive:', alive
!            endif
            if (temp.lt.(alive+2).and.temp.gt.(alive-2)) then
               count = count + 1
	    else
	       count = 0
            endif
            temp = alive
	    steps = steps + 1

!$omp end master
!$omp barrier
 	
	end do
	
	deallocate(dOld)
	deallocate(dNew)

!$omp end parallel
  
	alive = sum(A)
	print *, 'Step:', steps, ' Alive:',alive
	print *, 'Init%:', initPercent, ' Actual%:', real(alive)/real(NXSIZE*NYSIZE)
	close(21)
	deallocate(A,B)
		
	stop

end subroutine lifeMain
end module life_mod


program life
use life_mod
implicit none
call lifeMain()
end program life

% pgfortran -Mpreprocess -mp -fast -Minfo=accel -Mcuda life.F90 -o life.out