About PGI Fortran and CUDA 4.0

Can PGI Fortran do the multi-GPU computing without by using OpenMP in one workstation as the CUDA Toolkit 4.0?

Hi addison827,

Once CUDA 4.0 is officially released (currently only a pre-release version is available) we will add it to PGI’s installation shortly after.

While I haven’t used CUDA 4.0, it’s my understanding that it will allow you to have a single context use multiple GPUs. You will need to manage all data but will be allowed copy data directly between devices. I believe that CUDA functions were simply overloaded to add this support so they should ‘just work’ with CUDA Fortran. However, the PGI Accelerator will need modifications in order to support multiple GPUs and may be awhile before this support is added.

Hope this helps,
Mat

According to the release news, the Fortran 11.6 supports CUDA 4.0. But, how to use the peer-to-peer communication? Should i make any changes in my program in order to use several GPUs without arranging OpenMP communication?

Being totally new at CUDA, let me ask a (maybe) stupid question. When I run the deviceQuery from the SDK I get a runtime and driver version of 4.0. However, when using DriverGetVersion and RuntimeGetVersion in a test code compiled with pgfortran I get a driver version of 4.0 and a runtime of 3.2. I am using 11.6. Is this normal behavior? I am a bit confused of where this 3.2 runtime version is coming from.
Thanks in advanced.

Nevermind my question, I found the -ta:nvidia flag to choose runtime. Thanks anyway :)

Hi Mike,

But, how to use the peer-to-peer communication? Should i make any changes in my program in order to use several GPUs without arranging OpenMP communication?

Like Asynchronous data copies, these isn’t a natural way in the Fortran syntax to allow for this, so you’ll need to use the CUDA ABI directly.

While I hadn’t used this feature until now, I put together an example Vector Add program. It first enables the Peer To Peer communication, copies two arrays to device 0, copies the arrays from device 0 to device 1, performs the vector add on device 1, and copies the data back to the host. We haven’t added “cudaMemcpyDefault”, “cudaDeviceCanAccessPeer”, “cudaDeviceEnablePeerAccess”, and “cudaDeviceDisablePeerAccess” to the “cudafor” module yet, so you need to add the interfaces for now. I believe I’m correct, but there could be bugs in my code, so caveat emprot. Hopefully it will steer you in the right direction.

  • Mat
module vadd_mod
    use cudafor
    use iso_c_binding

    integer(c_int), parameter :: cudaMemcpyDefault=4
    
    interface
       function cudaDeviceCanAccessPeer(canAccessPeer, devnum, peerDev) &
	   bind(c, name='cudaDeviceCanAccessPeer')
	
	   use iso_c_binding
	   integer(c_int) :: canAccessPeer
	   integer(c_int), value :: devnum
	   integer(c_int), value :: peerDev
	   integer(c_int) :: cudaCanAccessPeer
       end function cudaDeviceCanAccessPeer
	
       function cudaDeviceEnablePeerAccess(peerDev, flags) &
	   bind(c, name='cudaDeviceEnablePeerAccess')

	   use iso_c_binding
	   integer(c_int), value :: peerDev
	   integer(c_int), value :: flags
	   integer(c_int) :: cudaDeviceEnablePeerAccess
       end function cudaDeviceEnablePeerAccess
    end interface	   


contains
    attributes(global) subroutine vadd_kernel(A,B,C,N)
	implicit none
	real(4), device :: A(N), B(N), C(N)
	integer, value :: N
	integer :: i
	i = (blockidx%x-1)*blockdim%x + threadidx%x
	if( i <= N ) C(i) = A(i) + B(I)
    end subroutine

    subroutine vadd( A, B, C )
	implicit none
	real(4), dimension(:) :: A, B, C
	real(4), device, allocatable, dimension(:):: Ad0, Bd0
	real(4), device, allocatable, dimension(:):: Ad1, Bd1, Cd1
	integer :: N, istat, dev0, dev1, canaccess
	N = size( A, 1 )

	dev0=0
        dev1=1	

        ! Determine if Peer to Peer communication is supported
        istat = cudaDeviceCanAccessPeer(canaccess,dev0,dev1)
	if (canaccess) then
          ! set-up the communication between the devices, use cudaSetDevice to
          ! swtich context between the devices
          istat = cudaSetDevice(dev0)
          istat = cudaDeviceEnablePeerAccess(dev1,dev0)
          istat = cudaSetDevice(dev1)
          istat = cudaDeviceEnablePeerAccess(dev0,dev0)
	  
          !  allocate device 1's arrays
	  allocate( Ad1(N), Bd1(N), Cd1(N))

          ! allocate device 0's arrays
          istat = cudaSetDevice(dev0)
  	  allocate( Ad0(N), Bd0(N))

          ! copy host data to device 0
	  Ad0 = A(1:N)
	  Bd0 = B(1:N)

          ! copy device 0 data to device 1
	  istat = cudaMemcpy(Ad1,Ad0,N,cudaMemcpyDefault)
	  istat = cudaMemcpy(Bd1,Bd0,N,cudaMemcpyDefault)
	
          ! launch the kernel on device 1
          istat = cudaSetDevice(dev1)
	  call vadd_kernel<<< (N+63)/64, 64 >>>( Ad1, Bd1, Cd1, N )
    
          ! get the results
	  C(1:N) = Cd1

          ! deallocate the arrays on device 1
	  deallocate( Ad1, Bd1, Cd1 )

          ! deallocate the arrays on device 0
	  istat = cudaSetDevice(dev0)
	  deallocate( Ad0, Bd0)

        endif
    end subroutine
end module

program p
    use vadd_mod
    use cudafor
    implicit none
    integer :: nargs, n, i, ierr
    real, allocatable, dimension(:) :: a, b, c
    integer :: dt1(8), dt2(8), t1, t2
    external iargc
    integer iargc
    real :: rt
    character*10 arg

    nargs = iargc()
    if( nargs == 0 )then
	print *, 'a.out size1'
	return
    endif
    if( nargs >= 1 )then
	call getarg( 1, arg )
	read(arg,'(i)') n
    endif

    allocate( a(n), b(n), c(n) )
    do i = 1,n 
	a(i) = i
	b(i) = 2*i
	c(i) = 0
    enddo

    call date_and_time( values=dt1 )
    call vadd( a, b, c )
    call date_and_time( values=dt2 )
    t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
    t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
    rt = (t2 - t1)
    rt = rt / 1000.
    write(*,20) rt
    20 format( 'time=', f7.4, ' seconds' )

    ierr = 0
    do i = 1,n
	if( c(i) .ne. a(i) + b(i) )then
	    ierr = ierr + 1
	    if( ierr <= 10 )then
		print 30, i, c(i), a(i)+b(I)
		30 format( 'c(',i,') = ', f12.5, ' should be ', f12.5 )
	    endif
	endif
    enddo
    if( ierr == 0 )then
	print *, 'no errors'
    else
	print *,  ierr, ' errors found'
    endif
end program



% pgf90 -Mcuda=cuda4.0,cc20 ptp2.cuf -V11.6 -o ptp2.out
% ptp2.out 1024
time= 0.2130 seconds
 no errors

Hi papitas212,

However, when using DriverGetVersion and RuntimeGetVersion in a test code compiled with pgfortran I get a driver version of 4.0 and a runtime of 3.2. I am using 11.6. Is this normal behavior? I am a bit confused of where this 3.2 runtime version is coming from.
Thanks in advanced.

For compatibility reasons, the 11.6 compilers default to use the CUDA 3.2 runtime. To use CUDA 4.0, use the “-Mcuda=cuda4.0” flag for CUDA Fortran or “-ta=nvidia,cuda4.0” for the PGI Accelerator Model.

Hope this helps,
Mat

Am i right that initially we copy all arrays from host to device, then from device to device and finally execute VADD on the device 1, meanwhile device 0 doesn’t do anything?

Am i right that initially we copy all arrays from host to device, then from device to device and finally execute VADD on the device 1, meanwhile device 0 doesn’t do anything?

Of course, device 0 should launch it own kernel to do some useful work, but in this example I more just wanted to show how to set-up peer-to-peer communication and copy data between devices.

  • Mat

Great! Thanks Mat!

Is it possible to use OpenMP in order to proceed multiple device-to-device copyings simultaneously?

Is it possible to use OpenMP in order to proceed multiple device-to-device copyings simultaneously?

Each OpenMP thread would have a unique context, so while I haven’t tried it, I would assume the answer is yes you can do multiple unrelated device-to-device transfers but no you can’t use OpenMP to interleave peer-to-peer data transfers within the same context.

For a single context, I would investigate the use of Streams and Asynchronous data transfers. See: http://www.pgroup.com/lit/articles/insider/v3n1a4.htm. Granted, I haven’t tired it myself so I’m not positive if cudaMemcpyAsync has been overloaded to support peer-to-peer transfers like cudaMemcpy or if you need to use cudaMemcpyPeerAsync directly. It may take a bit of experimentation but should work.

Note that I talked with our Compiler Engineering Manager today. He’s in the process of adding to the cudafor module the interfaces for the 22 new CUDA 4.0 routines and will have them available in July’s 11.7 release. So you having to add explicit interfaces for them will be short lived.

  • Mat

Mat, after compiling and running the code as is I got the “C(i) = 0.0, should be …” error meaning the variables were not being copied from card to card.
In an attempt to debug, I changed the cudaMemcpyDefault to cudaMemcpyDevicetoDevice in the cudaMemcpy which brought my linux box to its knees! It completely froze the whole thing… twice! Haha… I am running Fedora 14 and since it came with gcc 4.5, I installed gcc 4.4.6 in /opt so it would be to CUDA’s liking (I add /opt to the front of my PATH before running anything CUDA related). What do you think are the reasons for both/either of my issues?
Thanks Mat!

Hi papitas212,

What kind of GPU do you have? The output from pgaccelinfo will tell you if you don’t know.

You need a Fermi card in order to perform Peer to Peer transfers so the code may be skipping the computation since it’s guarded by the “if (canaccess)” statement. I should have and an else clause if canaccess is false and printed out an error. Though, this would not explain why changing cudaMemcpyDefault to cudaMemcpyDevicetoDevice hosed your system.

cudaMemcpyDevicetoDevice is to copy data on the same device and wont work for peer to peer. You could try using cudaMemcpyPeer instead.

Other then these two things, I’m not sure what the problem would be. Peer to peer is new for me as well so some experimentation may be needed.

  • Mat

Thanks for your reply Mat.
I have 2 C2050s. I actually checked that canaccess was returning 1 and that was the case. Oh, I see about DevicetoDevice… the name is misleading! So then I assume that cudaMemcpyPeer=4 as the type of copy? The strange thing is that the memory is being allocated fine on both cards (I check after each allocation by copying back to host and printing) but just the copy doesn’t go and doesn’t produce any visible error.
Thanks for your help again Mat.
-Andres

Is it possible to copy just a part of an array from the device 0 to device 1, while another part of array on device 1 remains unchanged?

Hi Mike,

What you need to do is write a kernel that gathers the data you want to copy into a contiguous array, transfer the array, and then have another kernel on the destination device scatter the data back out.

Unfortunately, I don’t have an example just yet. For my next PGInsider article article, I’m planning on writing about multi-gpu programing and will be passing halo regions between GPUs. Using gathers and scatters will be my plan of attack, but I just haven’t had time to write the code.

  • Mat

Hi Mat,

Thank you for helpful answers. I found that Windows version of PGI fortran does not like the “cudaMemcpyDefault” command, is doesn’t copy any arrays from device to device, but everything works if you simply replace it with “cudaMemcpyDevicetoDevice”.

Thank you very much for your help :)

Another thing that i would like to point is that Simple p2p test provided by Nvidia says that GTX 590 doesn’t support p2p, meanwhile your p2p code works.

Dear Mat,

I have a problem with P2P communiation between GTX460 and GTX590. Here is your code with couple additional lines:

module vadd_mod
    use cudafor
    use iso_c_binding

    integer(c_int), parameter :: cudaMemcpyDefault=4
   
    interface
       function cudaDeviceCanAccessPeer(canAccessPeer, devnum, peerDev) &
      bind(c, name='cudaDeviceCanAccessPeer')
   
      use iso_c_binding
      integer(c_int) :: canAccessPeer
      integer(c_int), value :: devnum
      integer(c_int), value :: peerDev
      integer(c_int) :: cudaCanAccessPeer
       end function cudaDeviceCanAccessPeer
   
       function cudaDeviceEnablePeerAccess(peerDev, flags) &
      bind(c, name='cudaDeviceEnablePeerAccess')

      use iso_c_binding
      integer(c_int), value :: peerDev
      integer(c_int), value :: flags
      integer(c_int) :: cudaDeviceEnablePeerAccess
       end function cudaDeviceEnablePeerAccess
    end interface      


contains
    attributes(global) subroutine vadd_kernel(A,B,C,N)
   implicit none
   real(4), device :: A(N), B(N), C(N)
   integer, value :: N
   integer :: i
   i = (blockidx%x-1)*blockdim%x + threadidx%x
   if( i <= N ) C(i) = A(i) + B(I)
    end subroutine

    subroutine vadd( A, B, C ,N)
   implicit none
   real(4):: A(N), B(N), C(N)
   real(4), device, allocatable, dimension(:):: Ad0, Bd0, Cd0
   real(4), device, allocatable, dimension(:):: Ad1, Bd1, Cd1
   real(4), device, allocatable, dimension(:):: Ad2, Bd2, Cd2
   integer :: N, istat, dev0, dev1,dev2, canaccess,Ndev,i
       integer :: dt1(8), dt2(8), t1, t2
    real :: rt
  type (cudaEvent) :: startEvent, stopEvent, dummyEvent
  real :: time
	type(cudadeviceprop):: prop
	open(1,file='Info.dat')
	Ndev=3
	do i=0,Ndev-1
	istat=cudaGetDeviceProperties( prop, i)
	write(*,*)"The available CUDA capable device is ",trim(prop%name)
	write(1,*)"The available CUDA capable device is ",trim(prop%name)
	enddo
	print *,''
   dev0=0
        dev1=1
        dev2=2   

        ! Determine if Peer to Peer communication is supported
        istat = cudaDeviceCanAccessPeer(canaccess,dev0,dev1)
        if (canaccess) then
			print *,'P2P communication between device ',dev0,' device ',dev1,' is supported'
			write(1,*) 'P2P communication between device ',dev0,' device ',dev1,' is supported'
		endif
        istat = cudaDeviceCanAccessPeer(canaccess,dev0,dev2)
        if (canaccess) then
			print *,'P2P communication between device ',dev0,' device ',dev2,' is supported'
			write(1,*) 'P2P communication between device ',dev0,' device ',dev2,' is supported'
		endif
        istat = cudaDeviceCanAccessPeer(canaccess,dev1,dev2)
        if (canaccess) then
			print *,'P2P communication between device ',dev1,' device ',dev2,' is supported'
			write(1,*)'P2P communication between device ',dev1,' device ',dev2,' is supported'
		endif
     close(1)
	if (canaccess) then
          ! set-up the communication between the devices, use cudaSetDevice to
          ! swtich context between the devices
          
          istat = cudaSetDevice(dev0)
          istat = cudaDeviceEnablePeerAccess(dev1,dev0)
          istat = cudaSetDevice(dev1)
          istat = cudaDeviceEnablePeerAccess(dev0,dev0)
    
          !  allocate device 1's arrays
     allocate( Ad1(N), Bd1(N), Cd1(N))

          ! allocate device 0's arrays
          istat = cudaSetDevice(dev0)
       allocate( Ad0(N), Bd0(N), Cd0(N))

          ! copy host data to device 0
     Ad0 = A(1:N)
     Bd0 = B(1:N)

          ! copy device 0 data to device 1
     istat = cudaMemcpy(Ad1,Ad0,N,cudaMemcpyDevicetoDevice)
     istat = cudaMemcpy(Bd1,Bd0,N,cudaMemcpyDevicetoDevice)
 
          ! launch the kernel on device 1
          istat = cudaSetDevice(dev1)
     call vadd_kernel<<<N>>>( Ad1, Bd1, Cd1, N )
     call date_and_time( values=dt1 )
	istat = cudaEventCreate(startEvent)
	istat = cudaEventCreate(stopEvent)  
	istat = cudaEventCreate(dummyEvent) 
     
     istat = cudaMemcpy(Bd0,Bd1,N,cudaMemcpyDevicetoDevice)
     istat = cudaMemcpy(Ad0,Ad1,N,cudaMemcpyDevicetoDevice)
	 istat = cudaMemcpy(Cd0,Cd1,N,cudaMemcpyDevicetoDevice)
  istat = cudaEventRecord(stopEvent, 0)
  istat = cudaEventSynchronize(stopEvent)
  istat = cudaEventElapsedTime(time, startEvent, stopEvent)
  write(*,*) 'Time for P2P transfer (ms): ', time
	 call date_and_time( values=dt2 )
     t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
     t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
	 rt = (t2 - t1)
     rt = rt / 1000.
     print *,t2,t1
     print *,rt
     print *, 'P2P speed is ',(3.E0*N*4.E0/(1024*1024))/rt, 'MB/sec'
          ! get the results
     call date_and_time( values=dt1 )
     C(1:N) = Cd0
     call date_and_time( values=dt2 )
		t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
		t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
	 rt = (t2 - t1)
     rt = rt / 1000.
	print *, 'From device speed is ',(N*4.E0/(1024*1024))/rt, 'MB/sec'
          ! deallocate the arrays on device 1
     deallocate( Ad1, Bd1, Cd1 )

          ! deallocate the arrays on device 0
     istat = cudaSetDevice(dev0)
     deallocate( Ad0, Bd0, Cd0)

        endif
    end subroutine
	end module

	program p
    use vadd_mod
    use cudafor
    implicit none
    integer :: nargs, n, i, ierr
    real, allocatable, dimension(:) :: a, b, c
    integer :: dt1(8), dt2(8), t1, t2
    real :: rt


	N=1024*2048

    allocate( a(n), b(n), c(n) )
    do i = 1,n
	a(i) = i
	b(i) = 2*i
	c(i) = 0
    enddo

    call date_and_time( values=dt1 )
    call vadd( a, b, c ,N)
    call date_and_time( values=dt2 )
    t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
    t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
    rt = (t2 - t1)
    rt = rt / 1000.
    write(*,20) rt
    20 format( 'time=', f7.4, ' seconds' )

    ierr = 0
    do i = 1,n
   if( c(i) .ne. a(i) + b(i) )then
       ierr = ierr + 1
       if( ierr <= 10 )then
      print 30, i, c(i), a(i)+b(I)
      30 format( 'c(',i,') = ', f12.5, ' should be ', f12.5 )
       endif
   endif
    enddo
    if( ierr == 0 )then
   print *, 'no errors'
    else
   print *,  ierr, ' errors found'
    endif
end program

This is what the program returns:
The available CUDA capable device is GeForce GTX 590
The available CUDA capable device is GeForce GTX 590
The available CUDA capable device is GeForce GTX 460
P2P communication between device 0 device 1
is supported

Thus, the program says that i can not establish a P2P connection between GTX460 and GTX590. What did i do wrong? Is it possible to solve this problem somehow?

The compiler’s version is 11.6

Hi Mike,

Are all of your motherboard slots PCIe? My best guess is that the third one is not. P2P won’t work across different IOH chip sets.

  • Mat