Using multiple GPUs

Hi,
I’m trying to run different code simultaneously on 2 GPUs. I’m under the impression that this requires OpenMP but I can’t seem to get the code to work.
pgaccelinfo is picking up both devices and i am using the most recent CUDA drivers etc.

The code at the bottom of the post results in the following messages at run time and then freezes:

call gpu code
number of threads: 2
Section 1, thread: 0
test 1
number of threads: 2
Section 2, thread: 1
no devices found, exiting
launch kernel file=gpu_xyzint_1_openmptest.f90 function=gpu_xyzint_1 line=969 grid=1 block=15

!$OMP PARALLEL SHARED(pint,qint,rint)
tid = OMP_GET_THREAD_NUM()
if (tid.eq.0) then
nthreads = OMP_GET_NUM_THREADS()
end if
print *, ‘number of threads:’,nthreads
!$OMP SECTIONS
!$OMP SECTION
print *, ‘Section 1, thread:’, OMP_GET_THREAD_NUM()
print *, ‘test 1’
call acc_set_device_num(0,acc_device_default)
!$acc region
!$acc do
do i=1,15
pint(i) = 0
qint(i) = 0
rint(i) = 0
end do
!$acc end region
! call gpucode(ngpu,lgpu)
!$OMP SECTION
print *, ‘Section 2, thread:’, OMP_GET_THREAD_NUM()
!$acc region
call acc_set_device_num(1,acc_device_default)
!$acc do
do i=16,31
pint(i) = 0
qint(i) = 0
rint(i) = 0
end do
!$acc end region
! call gpucode(ngpu,lgpu)
!$OMP END SECTIONS NOWAIT
!$OMP END PARALLEL

As you can see, I have replaced the call to a separate accelerated subroutine with some simple code. Would the call work when used in this way?


Many thanks,

Karl

Hi Karl,

Unfortunately, support for using accelerator regions within OpenMP regions is not in 9.0 yet. (see FAQ: Multiple GPU Support) We are actively working on adding this and, if all goes well, are expecting preliminary support in September’s 9.0-4 monthly release.

Though, I not sure where the “no devices found, exiting” error is coming from. I worked up a small test case using your sample but get the error “libcuda.so not found, exiting”. You’re welcome to send me the code and I can see what’s going on.

Thanks,
Mat

Hi Mat,

I was initially getting the libcuda.so error but when I installed the latest cuda version it went away. I thought that issue arose from the installation of the second GPU though.

I’ve emailed the code to you.

Many thanks,

Karl

edit:
Is there any other way to run different !$acc regions simultaneously on different GPUs?
Also, are there examples anywhere on running a normal region on multiple GPUs?

Cheers!

Hi Karlw,

Is there any other way to run different !$acc regions simultaneously on different GPUs?

You can use MPI.

Also, are there examples anywhere on running a normal region on multiple GPUs?

Besides MPI, and in the future OpenMP or pthreads, we do not support dividing an accelerator region across multiple devices. Though, this is an evolving model so it may be possible in the future.

  • Mat

Okay, Mat, I have a question, now. How does one use MPI and !$acc together?

I currently have a big, big program that is MPI and I’m thinking of accelerating a small part of it way down in the code-tree that is 25-30% of the CPU time (and it should be fairly CUDA friendly, no intercommunication, etc.).

The CUDA testbed I’m using has 4 CPUs and a Tesla S1060 (= 4 GPUs). Thus, I have a nice one-to-one ratio. If I used the accelerator pragmas, and ran this mpirun -np 4, would it “automagically” have rank n use GPU n, or do I need to add additional logic to the code?

I’m assuming the latter, and so is there an example PGI has that shows how to do that? (Of course, I’m hoping for the former!)

Matt

Hi Matt,

I went ahead a wrote a small matmul program using MPI and accelerator regions. The main thing to note is that each process should set the device number to avoid too much contention (though processes can share a device).

      
#ifdef _ACCEL
      numdevices = acc_get_num_devices(acc_device_nvidia)
      mydevice = mod(myRank,numdevices)
      call acc_set_device_num(mydevice,acc_device_nvidia)
#endif

Here’s the full source. Please let me know if you spot any bugs since I haven’t done MPI programming since grad school. ;-)

  • Mat
% cat mpi_mm_acc.f90                                                                 
! Simple matmul example

    module mymm
    contains

    subroutine mm_acc( a, b, c, m, start, end)
     implicit none
     real, dimension(:,:), intent(out) :: a
     real, dimension(:,:), intent(in) :: b,c
     integer m, start, end, i,j,k

!$acc region  copyin(b(:m,:m), c(:m,:m)),       &
!$acc& copyout(a(start:end,:m))

     do j = 1, m
        do i = start,end
            a(i,j) = 0.0
        enddo

       do i = start,end
         do k = 1,m
            a(i,j) = a(i,j) + b(i,k) * c(k,j)
         enddo
       enddo
     enddo
!$acc end region

    end subroutine

    end module


      PROGRAM mpi_matmul_acc
        use mymm
        use mpi
#ifdef _ACCEL
        use accel_lib
#endif
      IMPLICIT NONE
      INTEGER :: myRank, ierr, status(MPI_STATUS_SIZE)
      INTEGER :: count, dest, source, tag, size, icode
      INTEGER :: nn, chunk, start, end, a_chunk_type
      INTEGER :: i, j, n, mydevice, numdevices
      REAL,allocatable :: a(:,:)
      REAL,allocatable :: b(:,:), c(:,:)

! set-up MPI
      CALL MPI_INIT(ierr)
        CALL MPI_COMM_RANK( MPI_COMM_WORLD, myRank, ierr )
        CALL MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )

! have task 0 get the size of the array from the user
      IF (myRank.EQ.0) THEN
        print *, 'enter array size to run'
        read *, nn
      END IF

! send the size to the processes
      call MPI_BCAST(nn,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

! set the number of columns each process should work on
      chunk = nn / size

#ifdef _ACCEL
! set the device for each process
      numdevices = acc_get_num_devices(acc_device_nvidia)
      mydevice = mod(myRank,numdevices)
      call acc_set_device_num(mydevice,acc_device_nvidia)
#endif

! set-up the vector type for our arrays
      CALL MPI_TYPE_VECTOR(1, nn, nn, MPI_REAL, a_chunk_type, ierr)
      CALL MPI_TYPE_COMMIT(a_chunk_type,ierr)

! allocate our arrays
      allocate(a(nn,nn))
      allocate(b(nn,nn))
      allocate(c(nn,nn))
      tag=99
      a=0

! have task 0 initialize the b and c arrays with random numbers
      IF (myRank.EQ.0) THEN
         call RANDOM_NUMBER(b)
         call RANDOM_NUMBER(c)
      end if

! sync the b and c arrays
      call MPI_BCAST(b,nn*nn,MPI_REAL,0,MPI_COMM_WORLD,ierr)
      call MPI_BCAST(c,nn*nn,MPI_REAL,0,MPI_COMM_WORLD,ierr)

! determine the starting column
      start = (myRank*chunk)+1

! call the MM routine
      call mm_acc( a, b, c, nn, start, start+chunk-1 )

! Have process 0 Gather the results
      IF (size .gt. 1. .and. myRank.EQ.0) THEN
         do source=1,size-1
            start = (source*chunk)+1
            do i=start,start+chunk
               CALL MPI_RECV(a(i,:), 1, a_chunk_type, source, tag,  &
                    MPI_COMM_WORLD, status, ierr)
            end do

         end do

      END IF

!have the other processes send their data to process 0
      IF (size .gt. 1. .and. myRank.ne.0) THEN
         dest = 0
         do i=start,start+chunk
            CALL MPI_SEND(a(i,:), 1, a_chunk_type, dest, tag,  &
                 MPI_COMM_WORLD, ierr)
         end do

      END IF

! have task 0 print out a few of the values to make sure it got the results back
      IF (myRank.EQ.0) THEN
        do i=1,size
            PRINT*, "TASK #", myRank, " ", a((i*chunk)-1,34)
        end do
      end if

! Clean-up
      deallocate(a)
      deallocate(b)
      deallocate(c)

      CALL MPI_FINALIZE(ierr)

      END
% mpif90 -Mpreprocess -o mmacc.out mpi_mm_acc.f90 -fastsse -Minfo=accel -V9.0-3 -ta=nvidia
mm_acc:
     12, Generating copyout(a(start:end,:m))
         Generating copyin(c(:m,:m))
         Generating copyin(b(:m,:m))
     15, Loop is parallelizable
     16, Loop is parallelizable
         Accelerator kernel generated
         15, !$acc do parallel, vector(16)
         16, !$acc do parallel, vector(16)
     20, Loop is parallelizable
     21, Complex loop carried dependence of a prevents parallelization
         Loop carried dependence of a prevents parallelization
         Loop carried backward dependence of a prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         15, !$acc do parallel, vector(16)
         20, !$acc do parallel, vector(16)
             Using register for 'a'
         21, !$acc do seq
             Cached references to size [16x16] block of 'b'
             Cached references to size [16x16] block of 'c'
% mpirun -np 4 mmacc.out                                                              
enter array size to run
1024
launch kernel  file=mpi_mm_acc.f90 function=mm_acc line=16 grid=64x16 block=16x16
launch kernel  file=mpi_mm_acc.f90 function=mm_acc line=21 grid=64x16 block=16x16
launch kernel  file=mpi_mm_acc.f90 function=mm_acc line=16 grid=64x16 block=16x16
launch kernel  file=mpi_mm_acc.f90 function=mm_acc line=16 grid=64x16 block=16x16
launch kernel  file=mpi_mm_acc.f90 function=mm_acc line=21 grid=64x16 block=16x16
launch kernel  file=mpi_mm_acc.f90 function=mm_acc line=21 grid=64x16 block=16x16
launch kernel  file=mpi_mm_acc.f90 function=mm_acc line=16 grid=64x16 block=16x16
launch kernel  file=mpi_mm_acc.f90 function=mm_acc line=21 grid=64x16 block=16x16
 TASK #            0      264.2788
 TASK #            0      267.8382
 TASK #            0      268.5576
 TASK #            0      263.5367

Thanks! This is exactly what I was hoping for: a “transparent” sort of hook. I’ll be sure to let you know when I get a chance to try it out! (At the moment, I’m aiming to try one CPU and one GPU. Get that going first!)

One more (of many, to be sure) question: if you had 8 CPUs using 4 GPUs, would you incur initialization penalties each time you cycled through the mod? That is, if it turned out that rank 0 asked for the GPU 0, then rank 4, if rank 0 then asks for it again, do you incur the reinitialize time again?

Matt

One more (of many, to be sure) question: if you had 8 CPUs using 4 GPUs, would you incur initialization penalties each time you cycled through the mod? That is, if it turned out that rank 0 asked for the GPU 0, then rank 4, if rank 0 then asks for it again, do you incur the reinitialize time again?

I don’t believe so. Once a device is initialized, it doesn’t need to be initialized again. Even if t initialized by another program.

  • Mat