CUDA Fortran+Openmp problem

Hi,all
I want to use two cpu threads to control two GPUs respectively. Here is the test program:
/////////////////////////////////////////////////
PROGRAM TEST

USE cudafor

USE omp_lib

IMPLICIT NONE

INTEGER :: istat

INTEGER :: GPU_ID,GPU_NUM,dev

REAL(KIND=8) :: cc

REAL(KIND=8),allocatable,dimension(:,:),device :: a

call omp_set_num_threads(2)

!$omp parallel private(GPU_ID,dev)

GPU_ID=omp_get_thread_num()

istat=cudaSetDevice(GPU_ID)

istat=cudaGetDevice(dev)

print*,'No.',GPU_ID,'CPU thread control','No.',dev,'GPU'

allocate(a(2,2))

a(:,:)=GPU_ID

!$omp end parallel

print*,'------------------------'

istat=cudaSetDevice(1)

istat=cudaGetDevice(GPU_ID)

print*,'GPU_ID',GPU_ID

cc=a(1,1)

print*,cc

print*,'------------------------'

istat=cudaSetDevice(0)

istat=cudaGetDevice(GPU_ID)

print*,'GPU_ID',GPU_ID

cc=a(1,1)

print*,cc

END PROGRAM

//////////////////////////////////////////
After running this program, I find that the output of “a(1,1)” are same. That means the value of the array “a” on GPUs are same. Why this happen? Can I use allocate to create arrays with same name on different GPU?

Thanks

You have 1 copy of the array “a”. You have a race condition where two CPU threads try to allocate it. Then set it to a value.

If you want a copy of A for each thread, and each GPU, you must make them OMP private.

Hi carowin,

Because “a” is shared by the threads. Unfortunately we don’t support putting device variables in “threadprivate” (which would be the non-device way of fixing this), so you’ll either need to make “a” private within the parallel region, which limits its scope, or use multiple device arrays.

Note that in my experience it’s much easier to use MPI+X (where X is CUDA Fortran, OpenACC, or OpenMP with target offload) when doing multi-GPU programming. It’s very challenging to mange multiple discrete memory and you end up having to manually decomposing the problem. In MPI domain decomposition in inherent in the model and then you have one-to-one relationship between the rank and a GPU. Plus, MPI allows arbitrary scaling of the number of GPUs both within a single node or multiple nodes.

-Mat

I agree with your 2nd statement. The first is not quite right. We do support OMP threadprivate, but you need to change your code a bit:

PROGRAM TEST
USE cudafor
USE omp_lib
IMPLICIT NONE
INTEGER :: istat
INTEGER :: GPU_ID,GPU_NUM,dev
REAL(KIND=8) :: cc
REAL(KIND=8),allocatable,dimension(:,:),device :: a
!$omp threadprivate(a)

call omp_set_num_threads(2)
!$omp parallel private(GPU_ID,dev)
GPU_ID=omp_get_thread_num()
istat=cudaSetDevice(GPU_ID)
istat=cudaGetDevice(dev)
print*,‘No.’,GPU_ID,‘CPU thread control’,‘No.’,dev,‘GPU’
allocate(a(2,2))
a(:,:)=GPU_ID
!$omp end parallel

print*,’------------------------’

!$omp parallel private(dev, cc)
istat=cudaGetDevice(dev)
cc=a(1,1)
print*,dev, cc
print*,’------------------------’
!$omp end parallel

end

It is far easier, and may be more flexible, to call an orphaned subroutine from your top-level OMP region and declare the device variables there, if they are short-lived. I don’t believe we support threadprivate device arrays at the module level, only declared at the same level as the OMP region, which is probably what Mat is referring to.

My bad. It’s only device arrays in modules that can’t be put in “threadprivate”.

That works! So why “cc” has to be valued and output use OMP? I find if I do not use OMP, “cc” will not change the value.

I wonder that how MPI solve this problem, if I want to use same name of the arrays on different GPUs?

Sorry but I’m not understanding this question.

Are you asking why the two separate sections need to be combined into a single “omp parallel” region? If so, it’s because each thread has it’s own separate private copy of “a” so if you want to access each copy it needs to be done within a parallel region.

I wonder that how MPI solve this problem, if I want to use same name of the arrays on different GPUs?

Each MPI rank is an independent process with it’s own address space. Hence while the name of a variable may be the same for each rank, the addresses are completely different.

I will try to use MPI+CUDA. Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.