Problem declaring device types inside OpenMP parallel loop.

Hi. I’m having some trouble working out how to access multiple GPUs using openMP with the PGI fortran compiler “pgfortran”.

Here is my problem. There is a very simple test program “openmpcudatest.cuf” below.

If I compile with “pgfortran openmpcudatest.cuf”, I get the right answer (I have set OMP_NUM_THREADS=2):

ithread : 0 idev : 0
Try to copy HOST data to device…
Copy for loop cycle 1 OK
Set Adev to 1 using the assigned GPU…
ithread : 0 idev : 0
Try to copy HOST data to device…
Copy for loop cycle 2 OK
Set Adev to 1 using the assigned GPU…
2.000000 2.000000 2.000000 2.000000
2.000000 2.000000 2.000000 2.000000
2.000000 2.000000

Now, if I compile with “pgfortran openmptest.cuf -mp”, the code won’t compile:

PGF90-S-0188-Argument number 1 to cudakerneltest: type mismatch (ompcudatest.cuf: 36)
0 inform, 0 warnings, 1 severes, 0 fatal for ompcudatest

i.e. it looks like the PGI compiler doesn’t like the device variable Adev being decalred in the OMP pragma PRIVATE declaration. If I remove it, the code does indeed compile. However, when I then run the code, I get this:

ithread : 1 idev : 1
Try to copy HOST data to device…
ithread : 0 idev : 0
Try to copy HOST data to device…
Copy for loop cycle 1 OK
Set Adev to 1 using the assigned GPU…
copyin Memcpy FAILED:17

i.e. one of the threads (the second one) has failed to copy private host array A to its designated device.

All I want to be able to do is to treat each openMP thread as separate, with its own attached GPU device, then add up all the data at the end.

Please can anyone tell me how to declare device variables inside openMP parallelised loops, or otherwise how to get openMP and CUDA to play nicely with eachother using pgfortran?

Thanks,

Rob.




program ompcudatest

use cudafor
use cudaKernelTest

implicit none

integer :: i,ithread,idev,iflag
real :: A (10),B(10)
real, device :: Adev(10)
integer :: omp_get_thread_num

!$OMP PARALLEL PRIVATE(i,ithread,idev,Adev,A) SHARED(B)
!$OMP DO

do i=1,2
ithread = omp_get_thread_num()
iflag = cudaSetDevice(ithread)
iflag = cudaGetDevice(idev)

print*, 'ithread : ',ithread, 'idev : ',idev

print*, ‘Try to copy HOST data to device…’

Adev = A

print*, ‘Copy for loop cycle ‘,i,’ OK’

call cudaKernelTest( Adev )

! Retrieve device array back to host for thread ithread

A = Adev

! Sum up arrays to shared host array B:

B = B + A

enddo

!$OMP END DO
!$OMP END PARALLEL

! 2 threads initiated, B should be 10 values of 2.0:

print*,B

end



! A very simple CUDA kernel + wrapper:

module cudaKernelTest

use cudafor

implicit none

contains

attributes(global) subroutine cudaKernelTest_kernel( n, Adev )

implicit none

integer, value :: n
real :: Adev(n)

Adev = 1.0

end subroutine cudaKernelTest_kernel

! Kernel wrapper:

subroutine cudaKernelTest( Adev )

implicit none

real, device :: Adev(:)
integer, value :: n

n = size(Adev,1)

print*, ‘Set Adev to 1 using the assigned GPU…’

call cudaKernelTest_kernel<<<64,64>>>(n, Adev)

end subroutine cudaKernelTest

end module cudaKernelTest

Hi alfvenwave,

Currently OpenMP doesn’t support “device” variables in the private clause. To work around this, pass “A” to “cudaKernelTest” and then declare “Adev” in “cudaKernelTest”.

For example:

% cat openmp.cuf
! A very simple CUDA kernel + wrapper:

module cudaKernelTest
use cudafor

implicit none

contains

attributes(global) subroutine cudaKernelTest_kernel( n, Adev )

implicit none
integer, value :: n
real :: Adev(n)

Adev = 1.0

end subroutine cudaKernelTest_kernel

! Kernel wrapper:
subroutine cudaKernelTest( A )

implicit none

real :: A(10)
real, device :: Adev(10)
integer, value :: n

Adev = 1.0
n = size(Adev,1)
!print*, 'Set Adev to 1 using the assigned GPU....'
call cudaKernelTest_kernel<<<64,64>>>(n, Adev)
A=Adev

end subroutine cudaKernelTest
end module cudaKernelTest

program ompcudatest

use cudafor
use cudaKernelTest

implicit none

integer :: i,ithread,idev,iflag
real :: A (10),B(10)
integer :: omp_get_thread_num

!$OMP PARALLEL PRIVATE(i,ithread,idev,A) SHARED(B)

ithread = omp_get_thread_num()
iflag = cudaSetDevice(ithread)
iflag = cudaGetDevice(idev)
print*, 'ithread : ',ithread, 'idev : ',idev

!$OMP DO
do i=1,2

!print*, ithread, ' Copy for loop cycle ',i,' OK'

call cudaKernelTest( A )

! Sum up arrays to shared host array B:

B = B + A
enddo

!$OMP END DO
!$OMP END PARALLEL

! 2 threads initiated, B should be 10 values of 2.0:
print*,B
end
% setenv OMP_NUM_THREADS 2
% pgf90 openmp.cuf -mp
% a.out
 ithread :             1 idev :             1
 ithread :             0 idev :             0
    2.000000        2.000000        2.000000        2.000000
    2.000000        2.000000        2.000000        2.000000
    2.000000        2.000000
xps730:/tmp/qa%

Hope this helps,
Mat

Thanks mkcolg - that works…

What am I missing in the above code. idev never appears on the right hand side and yet its value appears to be set! Would someone please explain!

Malcolm

Hi Malcom,

“idev” gets set in the call to cudaGetDevice.

iflag = cudaGetDevice(idev)
  • Mat