Problem declaring device types inside OpenMP parallel loop.

alfvenwave · April 8, 2010, 3:55pm

Hi. I’m having some trouble working out how to access multiple GPUs using openMP with the PGI fortran compiler “pgfortran”.

Here is my problem. There is a very simple test program “openmpcudatest.cuf” below.

If I compile with “pgfortran openmpcudatest.cuf”, I get the right answer (I have set OMP_NUM_THREADS=2):

ithread : 0 idev : 0
Try to copy HOST data to device…
Copy for loop cycle 1 OK
Set Adev to 1 using the assigned GPU…
ithread : 0 idev : 0
Try to copy HOST data to device…
Copy for loop cycle 2 OK
Set Adev to 1 using the assigned GPU…
2.000000 2.000000 2.000000 2.000000
2.000000 2.000000 2.000000 2.000000
2.000000 2.000000

Now, if I compile with “pgfortran openmptest.cuf -mp”, the code won’t compile:

PGF90-S-0188-Argument number 1 to cudakerneltest: type mismatch (ompcudatest.cuf: 36)
0 inform, 0 warnings, 1 severes, 0 fatal for ompcudatest

i.e. it looks like the PGI compiler doesn’t like the device variable Adev being decalred in the OMP pragma PRIVATE declaration. If I remove it, the code does indeed compile. However, when I then run the code, I get this:

ithread : 1 idev : 1
Try to copy HOST data to device…
ithread : 0 idev : 0
Try to copy HOST data to device…
Copy for loop cycle 1 OK
Set Adev to 1 using the assigned GPU…
copyin Memcpy FAILED:17

i.e. one of the threads (the second one) has failed to copy private host array A to its designated device.

All I want to be able to do is to treat each openMP thread as separate, with its own attached GPU device, then add up all the data at the end.

Please can anyone tell me how to declare device variables inside openMP parallelised loops, or otherwise how to get openMP and CUDA to play nicely with eachother using pgfortran?

Thanks,

Rob.

program ompcudatest

use cudafor
use cudaKernelTest

implicit none

integer :: i,ithread,idev,iflag
real :: A (10),B(10)
real, device :: Adev(10)
integer :: omp_get_thread_num

!$OMP PARALLEL PRIVATE(i,ithread,idev,Adev,A) SHARED(B)
!$OMP DO

do i=1,2
ithread = omp_get_thread_num()
iflag = cudaSetDevice(ithread)
iflag = cudaGetDevice(idev)

print*, 'ithread : ',ithread, 'idev : ',idev

print*, ‘Try to copy HOST data to device…’

Adev = A

print*, ‘Copy for loop cycle ‘,i,’ OK’

call cudaKernelTest( Adev )

! Retrieve device array back to host for thread ithread

A = Adev

! Sum up arrays to shared host array B:

B = B + A

enddo

!$OMP END DO
!$OMP END PARALLEL

! 2 threads initiated, B should be 10 values of 2.0:

print*,B

end

! A very simple CUDA kernel + wrapper:

module cudaKernelTest

use cudafor

implicit none

contains

attributes(global) subroutine cudaKernelTest_kernel( n, Adev )

implicit none

integer, value :: n
real :: Adev(n)

Adev = 1.0

end subroutine cudaKernelTest_kernel

! Kernel wrapper:

subroutine cudaKernelTest( Adev )

implicit none

real, device :: Adev(:)
integer, value :: n

n = size(Adev,1)

print*, ‘Set Adev to 1 using the assigned GPU…’

call cudaKernelTest_kernel<<<64,64>>>(n, Adev)

end subroutine cudaKernelTest

end module cudaKernelTest

MatColgrove · April 8, 2010, 7:02pm

Hi alfvenwave,

Currently OpenMP doesn’t support “device” variables in the private clause. To work around this, pass “A” to “cudaKernelTest” and then declare “Adev” in “cudaKernelTest”.

For example:

% cat openmp.cuf
! A very simple CUDA kernel + wrapper:

module cudaKernelTest
use cudafor

implicit none

contains

attributes(global) subroutine cudaKernelTest_kernel( n, Adev )

implicit none
integer, value :: n
real :: Adev(n)

Adev = 1.0

end subroutine cudaKernelTest_kernel

! Kernel wrapper:
subroutine cudaKernelTest( A )

implicit none

real :: A(10)
real, device :: Adev(10)
integer, value :: n

Adev = 1.0
n = size(Adev,1)
!print*, 'Set Adev to 1 using the assigned GPU....'
call cudaKernelTest_kernel<<<64,64>>>(n, Adev)
A=Adev

end subroutine cudaKernelTest
end module cudaKernelTest

program ompcudatest

use cudafor
use cudaKernelTest

implicit none

integer :: i,ithread,idev,iflag
real :: A (10),B(10)
integer :: omp_get_thread_num

!$OMP PARALLEL PRIVATE(i,ithread,idev,A) SHARED(B)

ithread = omp_get_thread_num()
iflag = cudaSetDevice(ithread)
iflag = cudaGetDevice(idev)
print*, 'ithread : ',ithread, 'idev : ',idev

!$OMP DO
do i=1,2

!print*, ithread, ' Copy for loop cycle ',i,' OK'

call cudaKernelTest( A )

! Sum up arrays to shared host array B:

B = B + A
enddo

!$OMP END DO
!$OMP END PARALLEL

! 2 threads initiated, B should be 10 values of 2.0:
print*,B
end
% setenv OMP_NUM_THREADS 2
% pgf90 openmp.cuf -mp
% a.out
 ithread :             1 idev :             1
 ithread :             0 idev :             0
    2.000000        2.000000        2.000000        2.000000
    2.000000        2.000000        2.000000        2.000000
    2.000000        2.000000
xps730:/tmp/qa%

Hope this helps,
Mat

alfvenwave · April 9, 2010, 1:17pm

Thanks mkcolg - that works…

MMB · April 12, 2010, 3:35pm

What am I missing in the above code. idev never appears on the right hand side and yet its value appears to be set! Would someone please explain!

Malcolm

MatColgrove · April 12, 2010, 5:45pm

Hi Malcom,

“idev” gets set in the call to cudaGetDevice.

iflag = cudaGetDevice(idev)

Mat

Topic		Replies	Views
OpenMP + CUDA Fortran issue Legacy PGI Compilers	1	3217	February 8, 2011
Device Memory in OpenMP Legacy PGI Compilers	5	4039	November 21, 2017
OpenMP Threadprivate in CUDA Fortran Legacy PGI Compilers	1	2207	February 28, 2013
Multi-dimension array allocation problem Legacy PGI Compilers	2	2392	November 30, 2017
CUDA + OpenMP oddity - looks like a compiler bug. Legacy PGI Compilers	6	12251	April 12, 2010
CUDA Fortran+Openmp problem Legacy PGI Compilers	9	1255	March 3, 2022
Run-time error for multi-gpu programming with openmp (pgfort Legacy PGI Compilers	11	11647	January 25, 2014
problem of cudaGetDevice for MultiGPUs CUDA Programming and Performance	2	6569	November 12, 2009
Is it possible to use both OpenMP + CUDA in PGI fortran ? Legacy PGI Compilers	4	7772	December 18, 2010
OpenACC usage inside OpenMP constructs Legacy PGI Compilers	6	3945	August 26, 2019

Problem declaring device types inside OpenMP parallel loop.

Related topics