Can PGI Fortran do the multi-GPU computing without by using OpenMP in one workstation as the CUDA Toolkit 4.0?
Hi addison827,
Once CUDA 4.0 is officially released (currently only a pre-release version is available) we will add it to PGI’s installation shortly after.
While I haven’t used CUDA 4.0, it’s my understanding that it will allow you to have a single context use multiple GPUs. You will need to manage all data but will be allowed copy data directly between devices. I believe that CUDA functions were simply overloaded to add this support so they should ‘just work’ with CUDA Fortran. However, the PGI Accelerator will need modifications in order to support multiple GPUs and may be awhile before this support is added.
Hope this helps,
Mat
According to the release news, the Fortran 11.6 supports CUDA 4.0. But, how to use the peer-to-peer communication? Should i make any changes in my program in order to use several GPUs without arranging OpenMP communication?
Being totally new at CUDA, let me ask a (maybe) stupid question. When I run the deviceQuery from the SDK I get a runtime and driver version of 4.0. However, when using DriverGetVersion and RuntimeGetVersion in a test code compiled with pgfortran I get a driver version of 4.0 and a runtime of 3.2. I am using 11.6. Is this normal behavior? I am a bit confused of where this 3.2 runtime version is coming from.
Thanks in advanced.
Nevermind my question, I found the -ta:nvidia flag to choose runtime. Thanks anyway :)
Hi Mike,
But, how to use the peer-to-peer communication? Should i make any changes in my program in order to use several GPUs without arranging OpenMP communication?
Like Asynchronous data copies, these isn’t a natural way in the Fortran syntax to allow for this, so you’ll need to use the CUDA ABI directly.
While I hadn’t used this feature until now, I put together an example Vector Add program. It first enables the Peer To Peer communication, copies two arrays to device 0, copies the arrays from device 0 to device 1, performs the vector add on device 1, and copies the data back to the host. We haven’t added “cudaMemcpyDefault”, “cudaDeviceCanAccessPeer”, “cudaDeviceEnablePeerAccess”, and “cudaDeviceDisablePeerAccess” to the “cudafor” module yet, so you need to add the interfaces for now. I believe I’m correct, but there could be bugs in my code, so caveat emprot. Hopefully it will steer you in the right direction.
- Mat
module vadd_mod
use cudafor
use iso_c_binding
integer(c_int), parameter :: cudaMemcpyDefault=4
interface
function cudaDeviceCanAccessPeer(canAccessPeer, devnum, peerDev) &
bind(c, name='cudaDeviceCanAccessPeer')
use iso_c_binding
integer(c_int) :: canAccessPeer
integer(c_int), value :: devnum
integer(c_int), value :: peerDev
integer(c_int) :: cudaCanAccessPeer
end function cudaDeviceCanAccessPeer
function cudaDeviceEnablePeerAccess(peerDev, flags) &
bind(c, name='cudaDeviceEnablePeerAccess')
use iso_c_binding
integer(c_int), value :: peerDev
integer(c_int), value :: flags
integer(c_int) :: cudaDeviceEnablePeerAccess
end function cudaDeviceEnablePeerAccess
end interface
contains
attributes(global) subroutine vadd_kernel(A,B,C,N)
implicit none
real(4), device :: A(N), B(N), C(N)
integer, value :: N
integer :: i
i = (blockidx%x-1)*blockdim%x + threadidx%x
if( i <= N ) C(i) = A(i) + B(I)
end subroutine
subroutine vadd( A, B, C )
implicit none
real(4), dimension(:) :: A, B, C
real(4), device, allocatable, dimension(:):: Ad0, Bd0
real(4), device, allocatable, dimension(:):: Ad1, Bd1, Cd1
integer :: N, istat, dev0, dev1, canaccess
N = size( A, 1 )
dev0=0
dev1=1
! Determine if Peer to Peer communication is supported
istat = cudaDeviceCanAccessPeer(canaccess,dev0,dev1)
if (canaccess) then
! set-up the communication between the devices, use cudaSetDevice to
! swtich context between the devices
istat = cudaSetDevice(dev0)
istat = cudaDeviceEnablePeerAccess(dev1,dev0)
istat = cudaSetDevice(dev1)
istat = cudaDeviceEnablePeerAccess(dev0,dev0)
! allocate device 1's arrays
allocate( Ad1(N), Bd1(N), Cd1(N))
! allocate device 0's arrays
istat = cudaSetDevice(dev0)
allocate( Ad0(N), Bd0(N))
! copy host data to device 0
Ad0 = A(1:N)
Bd0 = B(1:N)
! copy device 0 data to device 1
istat = cudaMemcpy(Ad1,Ad0,N,cudaMemcpyDefault)
istat = cudaMemcpy(Bd1,Bd0,N,cudaMemcpyDefault)
! launch the kernel on device 1
istat = cudaSetDevice(dev1)
call vadd_kernel<<< (N+63)/64, 64 >>>( Ad1, Bd1, Cd1, N )
! get the results
C(1:N) = Cd1
! deallocate the arrays on device 1
deallocate( Ad1, Bd1, Cd1 )
! deallocate the arrays on device 0
istat = cudaSetDevice(dev0)
deallocate( Ad0, Bd0)
endif
end subroutine
end module
program p
use vadd_mod
use cudafor
implicit none
integer :: nargs, n, i, ierr
real, allocatable, dimension(:) :: a, b, c
integer :: dt1(8), dt2(8), t1, t2
external iargc
integer iargc
real :: rt
character*10 arg
nargs = iargc()
if( nargs == 0 )then
print *, 'a.out size1'
return
endif
if( nargs >= 1 )then
call getarg( 1, arg )
read(arg,'(i)') n
endif
allocate( a(n), b(n), c(n) )
do i = 1,n
a(i) = i
b(i) = 2*i
c(i) = 0
enddo
call date_and_time( values=dt1 )
call vadd( a, b, c )
call date_and_time( values=dt2 )
t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
rt = (t2 - t1)
rt = rt / 1000.
write(*,20) rt
20 format( 'time=', f7.4, ' seconds' )
ierr = 0
do i = 1,n
if( c(i) .ne. a(i) + b(i) )then
ierr = ierr + 1
if( ierr <= 10 )then
print 30, i, c(i), a(i)+b(I)
30 format( 'c(',i,') = ', f12.5, ' should be ', f12.5 )
endif
endif
enddo
if( ierr == 0 )then
print *, 'no errors'
else
print *, ierr, ' errors found'
endif
end program
% pgf90 -Mcuda=cuda4.0,cc20 ptp2.cuf -V11.6 -o ptp2.out
% ptp2.out 1024
time= 0.2130 seconds
no errors
Hi papitas212,
However, when using DriverGetVersion and RuntimeGetVersion in a test code compiled with pgfortran I get a driver version of 4.0 and a runtime of 3.2. I am using 11.6. Is this normal behavior? I am a bit confused of where this 3.2 runtime version is coming from.
Thanks in advanced.
For compatibility reasons, the 11.6 compilers default to use the CUDA 3.2 runtime. To use CUDA 4.0, use the “-Mcuda=cuda4.0” flag for CUDA Fortran or “-ta=nvidia,cuda4.0” for the PGI Accelerator Model.
Hope this helps,
Mat
Am i right that initially we copy all arrays from host to device, then from device to device and finally execute VADD on the device 1, meanwhile device 0 doesn’t do anything?
Am i right that initially we copy all arrays from host to device, then from device to device and finally execute VADD on the device 1, meanwhile device 0 doesn’t do anything?
Of course, device 0 should launch it own kernel to do some useful work, but in this example I more just wanted to show how to set-up peer-to-peer communication and copy data between devices.
- Mat
Great! Thanks Mat!
Is it possible to use OpenMP in order to proceed multiple device-to-device copyings simultaneously?
Is it possible to use OpenMP in order to proceed multiple device-to-device copyings simultaneously?
Each OpenMP thread would have a unique context, so while I haven’t tried it, I would assume the answer is yes you can do multiple unrelated device-to-device transfers but no you can’t use OpenMP to interleave peer-to-peer data transfers within the same context.
For a single context, I would investigate the use of Streams and Asynchronous data transfers. See: Account Login | PGI. Granted, I haven’t tired it myself so I’m not positive if cudaMemcpyAsync has been overloaded to support peer-to-peer transfers like cudaMemcpy or if you need to use cudaMemcpyPeerAsync directly. It may take a bit of experimentation but should work.
Note that I talked with our Compiler Engineering Manager today. He’s in the process of adding to the cudafor module the interfaces for the 22 new CUDA 4.0 routines and will have them available in July’s 11.7 release. So you having to add explicit interfaces for them will be short lived.
- Mat
Mat, after compiling and running the code as is I got the “C(i) = 0.0, should be …” error meaning the variables were not being copied from card to card.
In an attempt to debug, I changed the cudaMemcpyDefault to cudaMemcpyDevicetoDevice in the cudaMemcpy which brought my linux box to its knees! It completely froze the whole thing… twice! Haha… I am running Fedora 14 and since it came with gcc 4.5, I installed gcc 4.4.6 in /opt so it would be to CUDA’s liking (I add /opt to the front of my PATH before running anything CUDA related). What do you think are the reasons for both/either of my issues?
Thanks Mat!
Hi papitas212,
What kind of GPU do you have? The output from pgaccelinfo will tell you if you don’t know.
You need a Fermi card in order to perform Peer to Peer transfers so the code may be skipping the computation since it’s guarded by the “if (canaccess)” statement. I should have and an else clause if canaccess is false and printed out an error. Though, this would not explain why changing cudaMemcpyDefault to cudaMemcpyDevicetoDevice hosed your system.
cudaMemcpyDevicetoDevice is to copy data on the same device and wont work for peer to peer. You could try using cudaMemcpyPeer instead.
Other then these two things, I’m not sure what the problem would be. Peer to peer is new for me as well so some experimentation may be needed.
- Mat
Thanks for your reply Mat.
I have 2 C2050s. I actually checked that canaccess was returning 1 and that was the case. Oh, I see about DevicetoDevice… the name is misleading! So then I assume that cudaMemcpyPeer=4 as the type of copy? The strange thing is that the memory is being allocated fine on both cards (I check after each allocation by copying back to host and printing) but just the copy doesn’t go and doesn’t produce any visible error.
Thanks for your help again Mat.
-Andres
Is it possible to copy just a part of an array from the device 0 to device 1, while another part of array on device 1 remains unchanged?
Hi Mike,
What you need to do is write a kernel that gathers the data you want to copy into a contiguous array, transfer the array, and then have another kernel on the destination device scatter the data back out.
Unfortunately, I don’t have an example just yet. For my next PGInsider article article, I’m planning on writing about multi-gpu programing and will be passing halo regions between GPUs. Using gathers and scatters will be my plan of attack, but I just haven’t had time to write the code.
- Mat
Hi Mat,
Thank you for helpful answers. I found that Windows version of PGI fortran does not like the “cudaMemcpyDefault” command, is doesn’t copy any arrays from device to device, but everything works if you simply replace it with “cudaMemcpyDevicetoDevice”.
Thank you very much for your help :)
Another thing that i would like to point is that Simple p2p test provided by Nvidia says that GTX 590 doesn’t support p2p, meanwhile your p2p code works.
Dear Mat,
I have a problem with P2P communiation between GTX460 and GTX590. Here is your code with couple additional lines:
module vadd_mod
use cudafor
use iso_c_binding
integer(c_int), parameter :: cudaMemcpyDefault=4
interface
function cudaDeviceCanAccessPeer(canAccessPeer, devnum, peerDev) &
bind(c, name='cudaDeviceCanAccessPeer')
use iso_c_binding
integer(c_int) :: canAccessPeer
integer(c_int), value :: devnum
integer(c_int), value :: peerDev
integer(c_int) :: cudaCanAccessPeer
end function cudaDeviceCanAccessPeer
function cudaDeviceEnablePeerAccess(peerDev, flags) &
bind(c, name='cudaDeviceEnablePeerAccess')
use iso_c_binding
integer(c_int), value :: peerDev
integer(c_int), value :: flags
integer(c_int) :: cudaDeviceEnablePeerAccess
end function cudaDeviceEnablePeerAccess
end interface
contains
attributes(global) subroutine vadd_kernel(A,B,C,N)
implicit none
real(4), device :: A(N), B(N), C(N)
integer, value :: N
integer :: i
i = (blockidx%x-1)*blockdim%x + threadidx%x
if( i <= N ) C(i) = A(i) + B(I)
end subroutine
subroutine vadd( A, B, C ,N)
implicit none
real(4):: A(N), B(N), C(N)
real(4), device, allocatable, dimension(:):: Ad0, Bd0, Cd0
real(4), device, allocatable, dimension(:):: Ad1, Bd1, Cd1
real(4), device, allocatable, dimension(:):: Ad2, Bd2, Cd2
integer :: N, istat, dev0, dev1,dev2, canaccess,Ndev,i
integer :: dt1(8), dt2(8), t1, t2
real :: rt
type (cudaEvent) :: startEvent, stopEvent, dummyEvent
real :: time
type(cudadeviceprop):: prop
open(1,file='Info.dat')
Ndev=3
do i=0,Ndev-1
istat=cudaGetDeviceProperties( prop, i)
write(*,*)"The available CUDA capable device is ",trim(prop%name)
write(1,*)"The available CUDA capable device is ",trim(prop%name)
enddo
print *,''
dev0=0
dev1=1
dev2=2
! Determine if Peer to Peer communication is supported
istat = cudaDeviceCanAccessPeer(canaccess,dev0,dev1)
if (canaccess) then
print *,'P2P communication between device ',dev0,' device ',dev1,' is supported'
write(1,*) 'P2P communication between device ',dev0,' device ',dev1,' is supported'
endif
istat = cudaDeviceCanAccessPeer(canaccess,dev0,dev2)
if (canaccess) then
print *,'P2P communication between device ',dev0,' device ',dev2,' is supported'
write(1,*) 'P2P communication between device ',dev0,' device ',dev2,' is supported'
endif
istat = cudaDeviceCanAccessPeer(canaccess,dev1,dev2)
if (canaccess) then
print *,'P2P communication between device ',dev1,' device ',dev2,' is supported'
write(1,*)'P2P communication between device ',dev1,' device ',dev2,' is supported'
endif
close(1)
if (canaccess) then
! set-up the communication between the devices, use cudaSetDevice to
! swtich context between the devices
istat = cudaSetDevice(dev0)
istat = cudaDeviceEnablePeerAccess(dev1,dev0)
istat = cudaSetDevice(dev1)
istat = cudaDeviceEnablePeerAccess(dev0,dev0)
! allocate device 1's arrays
allocate( Ad1(N), Bd1(N), Cd1(N))
! allocate device 0's arrays
istat = cudaSetDevice(dev0)
allocate( Ad0(N), Bd0(N), Cd0(N))
! copy host data to device 0
Ad0 = A(1:N)
Bd0 = B(1:N)
! copy device 0 data to device 1
istat = cudaMemcpy(Ad1,Ad0,N,cudaMemcpyDevicetoDevice)
istat = cudaMemcpy(Bd1,Bd0,N,cudaMemcpyDevicetoDevice)
! launch the kernel on device 1
istat = cudaSetDevice(dev1)
call vadd_kernel<<<N>>>( Ad1, Bd1, Cd1, N )
call date_and_time( values=dt1 )
istat = cudaEventCreate(startEvent)
istat = cudaEventCreate(stopEvent)
istat = cudaEventCreate(dummyEvent)
istat = cudaMemcpy(Bd0,Bd1,N,cudaMemcpyDevicetoDevice)
istat = cudaMemcpy(Ad0,Ad1,N,cudaMemcpyDevicetoDevice)
istat = cudaMemcpy(Cd0,Cd1,N,cudaMemcpyDevicetoDevice)
istat = cudaEventRecord(stopEvent, 0)
istat = cudaEventSynchronize(stopEvent)
istat = cudaEventElapsedTime(time, startEvent, stopEvent)
write(*,*) 'Time for P2P transfer (ms): ', time
call date_and_time( values=dt2 )
t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
rt = (t2 - t1)
rt = rt / 1000.
print *,t2,t1
print *,rt
print *, 'P2P speed is ',(3.E0*N*4.E0/(1024*1024))/rt, 'MB/sec'
! get the results
call date_and_time( values=dt1 )
C(1:N) = Cd0
call date_and_time( values=dt2 )
t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
rt = (t2 - t1)
rt = rt / 1000.
print *, 'From device speed is ',(N*4.E0/(1024*1024))/rt, 'MB/sec'
! deallocate the arrays on device 1
deallocate( Ad1, Bd1, Cd1 )
! deallocate the arrays on device 0
istat = cudaSetDevice(dev0)
deallocate( Ad0, Bd0, Cd0)
endif
end subroutine
end module
program p
use vadd_mod
use cudafor
implicit none
integer :: nargs, n, i, ierr
real, allocatable, dimension(:) :: a, b, c
integer :: dt1(8), dt2(8), t1, t2
real :: rt
N=1024*2048
allocate( a(n), b(n), c(n) )
do i = 1,n
a(i) = i
b(i) = 2*i
c(i) = 0
enddo
call date_and_time( values=dt1 )
call vadd( a, b, c ,N)
call date_and_time( values=dt2 )
t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
rt = (t2 - t1)
rt = rt / 1000.
write(*,20) rt
20 format( 'time=', f7.4, ' seconds' )
ierr = 0
do i = 1,n
if( c(i) .ne. a(i) + b(i) )then
ierr = ierr + 1
if( ierr <= 10 )then
print 30, i, c(i), a(i)+b(I)
30 format( 'c(',i,') = ', f12.5, ' should be ', f12.5 )
endif
endif
enddo
if( ierr == 0 )then
print *, 'no errors'
else
print *, ierr, ' errors found'
endif
end program
This is what the program returns:
The available CUDA capable device is GeForce GTX 590
The available CUDA capable device is GeForce GTX 590
The available CUDA capable device is GeForce GTX 460
P2P communication between device 0 device 1
is supported
Thus, the program says that i can not establish a P2P connection between GTX460 and GTX590. What did i do wrong? Is it possible to solve this problem somehow?
The compiler’s version is 11.6
Hi Mike,
Are all of your motherboard slots PCIe? My best guess is that the third one is not. P2P won’t work across different IOH chip sets.
- Mat