Hi. I’m having some trouble working out how to access multiple GPUs using openMP with the PGI fortran compiler “pgfortran”.
Here is my problem. There is a very simple test program “openmpcudatest.cuf” below.
If I compile with “pgfortran openmpcudatest.cuf”, I get the right answer (I have set OMP_NUM_THREADS=2):
ithread : 0 idev : 0
Try to copy HOST data to device…
Copy for loop cycle 1 OK
Set Adev to 1 using the assigned GPU…
ithread : 0 idev : 0
Try to copy HOST data to device…
Copy for loop cycle 2 OK
Set Adev to 1 using the assigned GPU…
2.000000 2.000000 2.000000 2.000000
2.000000 2.000000 2.000000 2.000000
2.000000 2.000000
Now, if I compile with “pgfortran openmptest.cuf -mp”, the code won’t compile:
PGF90-S-0188-Argument number 1 to cudakerneltest: type mismatch (ompcudatest.cuf: 36)
0 inform, 0 warnings, 1 severes, 0 fatal for ompcudatest
i.e. it looks like the PGI compiler doesn’t like the device variable Adev being decalred in the OMP pragma PRIVATE declaration. If I remove it, the code does indeed compile. However, when I then run the code, I get this:
ithread : 1 idev : 1
Try to copy HOST data to device…
ithread : 0 idev : 0
Try to copy HOST data to device…
Copy for loop cycle 1 OK
Set Adev to 1 using the assigned GPU…
copyin Memcpy FAILED:17
i.e. one of the threads (the second one) has failed to copy private host array A to its designated device.
All I want to be able to do is to treat each openMP thread as separate, with its own attached GPU device, then add up all the data at the end.
Please can anyone tell me how to declare device variables inside openMP parallelised loops, or otherwise how to get openMP and CUDA to play nicely with eachother using pgfortran?
Thanks,
Rob.
program ompcudatest
use cudafor
use cudaKernelTest
implicit none
integer :: i,ithread,idev,iflag
real :: A (10),B(10)
real, device :: Adev(10)
integer :: omp_get_thread_num
!$OMP PARALLEL PRIVATE(i,ithread,idev,Adev,A) SHARED(B)
!$OMP DO
do i=1,2
ithread = omp_get_thread_num()
iflag = cudaSetDevice(ithread)
iflag = cudaGetDevice(idev)
print*, 'ithread : ',ithread, 'idev : ',idev
print*, ‘Try to copy HOST data to device…’
Adev = A
print*, ‘Copy for loop cycle ‘,i,’ OK’
call cudaKernelTest( Adev )
! Retrieve device array back to host for thread ithread
A = Adev
! Sum up arrays to shared host array B:
B = B + A
enddo
!$OMP END DO
!$OMP END PARALLEL
! 2 threads initiated, B should be 10 values of 2.0:
print*,B
end
! A very simple CUDA kernel + wrapper:
module cudaKernelTest
use cudafor
implicit none
contains
attributes(global) subroutine cudaKernelTest_kernel( n, Adev )
implicit none
integer, value :: n
real :: Adev(n)
Adev = 1.0
end subroutine cudaKernelTest_kernel
! Kernel wrapper:
subroutine cudaKernelTest( Adev )
implicit none
real, device :: Adev(:)
integer, value :: n
n = size(Adev,1)
print*, ‘Set Adev to 1 using the assigned GPU…’
call cudaKernelTest_kernel<<<64,64>>>(n, Adev)
end subroutine cudaKernelTest
end module cudaKernelTest